Skip to content

Instantly share code, notes, and snippets.

@shawngraham
shawngraham / groq_ocr.py
Last active February 24, 2026 19:10
use ocr_keyword_search.py for when you have a folder of ocr'd text, and a file with keywords ; use groq-ocr.py for when you need to get that text in the first place.
"""
groq_ocr.py
Processes newspaper images using Groq's vision API and extracts individual
articles to a CSV. Each row represents one article with associated page metadata.
Usage:
python newspaper_ocr.py --input_dir processed_output/images --output ocr_results.csv
Requirements:
@shawngraham
shawngraham / gemini 3 flash transcription
Created February 13, 2026 15:01
Aufbau Vol15no26 1July1949- Vol16no19 12May1950.pdf page 1
# AUFBAU
## RECONSTRUCTION
**AN AMERICAN WEEKLY PUBLISHED IN NEW YORK**
**by The New World Club, Inc., 209 West 48th Street, New York 19, N. Y. Phone: CIrcle 7-4462**
*Entered as second-class matter January 20, 1934, at the Post Office New York, N. Y. under Act of March 3, 1879*
**Vol. XV—No. 26 | NEW YORK, N. Y., FRIDAY, JULY 1, 1949 | Price 10¢**
***
> **Zunächst in "Aufbau":**
@shawngraham
shawngraham / prompt-for-archaeological-notebook-transcription.txt
Last active February 11, 2026 16:41
A prompt to use with gemma 3:27b for archaeological notebooks. Different models will require tweaking of the prompt I suspect.
**Role:** You are a precise archaeological document analyst specializing in the digitization of field notebooks and excavation catalogues.
**Task:**
1. Perform a spatial analysis of the document to distinguish between text blocks, artifact photographs/sketches, and marginalia.
2. Extract metadata and create a brief 2-3 sentence overview of the document's contents.
3. Transcribe the document EXACTLY as written into a valid YAML structure.
4. Extract archaeological entities into specific categories based only on explicit mentions.
**Critical Rules:**
- **Zero Hallucination:** Only include information directly visible in the image. If a word is illegible, mark it as `[illegible]`.
@shawngraham
shawngraham / pixplot-using-python3-12.ipynb
Last active January 9, 2026 17:34
A version of YaleDH's pixplot tool & image corpus similarity visualizer that runs on python 3.12, also extended to generate network edges, nodes for a similarity graph
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
index fruit quantity color store
0 apple 12 red loblaws
1 banana 18 yellow farm boy
2 grape 30 purple freshco
3 cherry 4 red iga
4 watermelon 2 green farm boy
5 raspberries 23 red iga
@shawngraham
shawngraham / archaeo_rag.ipynb
Created July 11, 2025 15:45
for use with https://shawngraham.github.io/homecooked-history/hm-generator-site/enhanced.html ; talk to your archaeological contexts! Import this to google colab to run.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
import mesa
class LetterAgent(mesa.Agent):
def __init__(self, model):
super().__init__(model)
self.letters_sent = 0
self.letters_received = 0
def step(self):
print(f"Hi, I am agent {self.unique_id}.")
@shawngraham
shawngraham / search.py
Created January 6, 2025 13:51
get images by motif from p-lod
%%capture
!python3 -m pip install git+https://github.com/p-lod/plodlib
!pip install requests_cache
!pip install rdflib
import plodlib
import json
import pandas as pd
from string import Template
import rdflib as rdf
@shawngraham
shawngraham / results.txt
Created December 20, 2024 19:22
retraining ModernBERT to identify archaeological metadata
Input Text:
This archive presents appendices B-I and supplementary material resulting from the programme of archaeological works undertaken during the construction scheme to widen the A1 trunk road between Dishforth and Leeming Bar in North Yorkshire. The Iron Age to early medieval evidence from Healam Bridge, along with other evidence for Roman activity along the route is published in two volumes
Extracted Entities:
Input Text:
This collection comprises images, spreadsheets, reports, vector graphics, and scanned site records and drawings from archaeological recording by Archaeological Research Services at Lower Radbourne Deserted Medieval Village, Warwickshire. The work was undertaken between April and December 2021. Area C32070 was dominated by intercutting features predominantly dated to two broad phases, prehistoric and medieval. The prehistoric features were represented by a large ring ditch, potentially dating to the Early Bronze Age, four smaller potential Bronze Age ring ditches and a series of inte