Production AI Automation Notes #7: PDF cite verification \u2014 making LLM extraction auditable with per-fragment bounding boxes
Production AI Automation Notes #7: PDF cite verification — making LLM extraction auditable with per-fragment bounding boxes
Last tested: May 2026. ledongthuc/pdf v0.0.0-2025, draftcat v0.1.
Follow @renezander030 for the rest of this series on production patterns for AI agents that touch real systems.
When an LLM extracts an invoice total, a contract clause, or a due-diligence claim from a PDF, you have two options: trust the model, or verify it. PDF cite verification gives you a deterministic step between extraction and trust. Every claim the model emits carries a <cite> tag, and a pure-Go resolver confirms that tag points to actual text on the actual page, with a bounding box you can render.
This entry walks through the two-action pipeline shipped in draftcat: pdf_extract and pdf_verify_cite, plus the four-layer fuzzy match that makes resolution tolerant of model paraphrasing without going so loose that hallucinations slip through.
This is part 7 of Production AI Automation Notes:
- #1 Agent Approval Gates
- #2 Token Budgets (publishing alongside)
- #3 Agentic Task System
- #4 Driving CapCut from LLM agents
- #5 SQLite Dedup + Crash Safety (publishing alongside)
- #6 Prompt-Injection Defense (publishing alongside)
- #7 PDF cite verification (this entry)
| step | input | output | failure mode |
|---|---|---|---|
pdf_extract |
vars.path or data.pdf_path |
pdf_doc, pdf_text, pdf_filename |
open error, max_pages exceeded |
ai step |
pdf_text + cite-emitting prompt |
ai_raw with <cite file="X" page="N">claim</cite> |
model omits cite tags |
pdf_verify_cite |
pdf_doc + ai_raw |
citations (file, page, text, resolved, span), citations_ok |
unresolved cite when fail_on_unresolved=true aborts the run |
Three production cases where this is non-optional:
- Invoice line-item extraction. The model says "total: $13,695.00." If that number does not appear on page 1 of the invoice, you have a hallucination that will land in your AP system.
- Contract review. A summary that says "termination requires 30 days written notice" is fiction if the clause is not on the cited page.
- Due-diligence summaries. A risk flag attributed to "page 12" needs to actually be on page 12 or the auditor's report is a liability.
Cite verification turns LLM extraction from "trust me" into "here is the bounding box on the rendered page." A UI overlay highlights the exact x/y rectangle the model is referencing. A reviewer clicks through every claim in seconds. A failed resolution kills the pipeline before the bad data leaves the deterministic layer.
The pattern in draftcat is three steps: deterministic parse, AI extraction with a cite-emitting prompt, deterministic verification. No model touches the verification logic. That is the whole point.
name: invoice-extract
steps:
- name: parse
type: action
action: pdf_extract
vars:
path: ./invoices/acme-2026-05.pdf
- name: extract
type: ai
skill: invoice_line_items
- name: verify
type: action
action: pdf_verify_cite
vars:
fail_on_unresolved: "true"Both pdf_extract and pdf_verify_cite are pure Go. No API call, no network, no token cost. They run in milliseconds and they cannot hallucinate.
pdf_extract opens the file with ledongthuc/pdf, walks every page, and emits a structured PDFDoc with per-fragment bounding boxes. Coordinates use top-left origin (Y flipped from raw PDF) so a browser overlay can draw rectangles directly.
type PDFDoc struct {
Filename string `json:"filename"`
Pages []PDFPage `json:"pages"`
}
type PDFPage struct {
Num int `json:"num"`
Width float64 `json:"width"`
Height float64 `json:"height"`
Text string `json:"text"`
Items []TextItem `json:"items"`
}
type TextItem struct {
Text string `json:"text"`
X float64 `json:"x"`
Y float64 `json:"y"`
W float64 `json:"w"`
H float64 `json:"h"`
FontName string `json:"font_name"`
FontSize float64 `json:"font_size"`
}The dispatcher writes three keys into pipeline data: pdf_doc (the full struct, used by the verifier), pdf_text (formatted with page markers for the LLM), and pdf_filename (used by the verifier to reject cross-document cites).
case "pdf_extract":
path := step.Vars["path"]
if path == "" {
if v, ok := data["pdf_path"].(string); ok { path = v }
}
if path == "" {
return fmt.Errorf("[step:%s] pdf_extract requires path var or data[pdf_path]", step.Name)
}
doc, err := pdfParser.Extract(path)
if err != nil {
return fmt.Errorf("[step:%s] pdf extract failed: %w", step.Name, err)
}
data["pdf_doc"] = doc
data["pdf_filename"] = doc.Filename
data["pdf_text"] = FormatPDFForPrompt(doc)FormatPDFForPrompt inserts --- Page N --- markers so the model knows what page number to put in its cite tags. Per-page text is capped at 4000 chars to keep token budgets sane.
The model never sees raw fragment coordinates. It sees prompt-formatted text with page markers, and it is instructed to emit one cite tag per claim. Example skill prompt:
You are an invoice line-item extractor. Read the document below and emit
one bullet per line item.
For every numeric value you cite (subtotal, total, line item, tax),
wrap the verbatim string in:
<cite file="{{pdf_filename}}" page="N">EXACT TEXT</cite>
Rules:
- "EXACT TEXT" must be copied character-for-character from the page.
- "N" is the page number from the --- Page N --- marker.
- Do not paraphrase numbers. Do not add a $ sign that is not there.
- If a claim cannot be cited, omit the claim.
Document:
{{pdf_text}}
The "EXACT TEXT" rule is load-bearing. The verifier's four-layer match tolerates whitespace and punctuation differences, but the closer the model stays to verbatim, the higher the resolution rate.
The verifier extracts every cite tag from ai_raw with one regex, then resolves each one against pdf_doc using a four-layer fuzzy match.
The regex:
var citeTagRe = regexp.MustCompile(`<cite file="([^"]+)" page="(\d+)">([^<]+)</cite>`)The dispatcher loop:
case "pdf_verify_cite":
doc, ok := data["pdf_doc"].(*PDFDoc)
if !ok {
return fmt.Errorf("[step:%s] pdf_verify_cite needs an earlier pdf_extract step", step.Name)
}
raw, _ := data["ai_raw"].(string)
cites := citeTagRe.FindAllStringSubmatch(raw, -1)
var unresolved []string
results := make([]map[string]interface{}, 0, len(cites))
for _, m := range cites {
file, pageStr, text := m[1], m[2], m[3]
page, _ := strconv.Atoi(pageStr)
entry := map[string]interface{}{
"file": file, "page": page, "text": text, "resolved": false,
}
if file != doc.Filename {
unresolved = append(unresolved, fmt.Sprintf("%s p%d (wrong file)", file, page))
results = append(results, entry)
continue
}
if span, found := pdfParser.FindSpan(doc, page, text); found {
entry["resolved"] = true
entry["span"] = span
} else {
unresolved = append(unresolved, fmt.Sprintf("%s p%d: %q", file, page, text))
}
results = append(results, entry)
}
data["citations"] = results
data["citations_ok"] = len(unresolved) == 0
if len(unresolved) > 0 && step.Vars["fail_on_unresolved"] == "true" {
return fmt.Errorf("[step:%s] unresolved citations: %s",
step.Name, strings.Join(unresolved, "; "))
}FindSpan tries four normalisations in order and returns on the first hit:
- exact:
strings.ToLower. Verbatim match modulo case. - whitespace: collapse runs of whitespace to single space. Handles models that emit
Total $13,695.00when the PDF has tabs. - currency: strip
$,€,£, and commas. Handles13695.00vs$13,695.00. - alphanum: keep only letters and digits. Last-resort match.
Each layer returns the union bounding box of the matched fragments plus the strategy name (exact, whitespace, currency, alphanum) so callers can downweight loose matches when scoring confidence.
The four-layer ladder is covered by TestFindSpan_Layered in pdf_test.go:
cases := []struct{ name, needle, wantStrat string }{
{"exact verbatim", "$13,695.00", "exact"},
{"whitespace-flexible", "Total $13,695.00", "whitespace"},
{"currency stripped", "13695.00", "currency"},
{"alphanumeric only", "Total $13695 00", "alphanum"},
}One pipeline variable decides whether unresolved cites kill the run or just log them:
vars:
fail_on_unresolved: "true" # compliance: any miss abortsvars:
fail_on_unresolved: "false" # triage: log misses, keep goingUse true for any pipeline whose output ships to a downstream system: AP, CRM, contract DB. A single unresolved cite means the model fabricated a claim and the deterministic layer caught it. Better to fail loud than write fiction to a system of record.
Use false for exploratory pipelines. A paralegal scanning 200 contracts for "any clauses worth a closer look" wants unresolved cites flagged, not the run killed. Default is false. Production pipelines should set it explicitly either way.
Honest limits, because the verifier is not a hallucination oracle:
- In-context hallucinations. The model can cite real text out of context. If page 3 has boilerplate "termination notice: 30 days" and the operative clause on page 17 says "90 days," the page-3 cite resolves happily. Cite verification proves text-on-page, not semantic correctness.
- Scanned PDFs.
ledongthuc/pdfreads embedded text. A scanned image-PDF produces zero items per page. Pipe throughocrmypdfor equivalent first, then runpdf_extracton the OCR output. Quality of the verification follows quality of the OCR. - Tables and multi-column layouts. Reading order in PDFs is publisher-dependent. A two-column page may interleave columns in
pg.Text, breaking whitespace-layer matches even when the alphanum layer rescues them. Tables with cell-by-cell extraction often need a structured parser liketabulaupstream. - Loose alphanum matches. The alphanum layer strips everything except letters and digits.
Total 13695 00matches both$13,695.00and any coincidental sequence on a different line. TheStrategyfield inSpanlets you rejectalphanumhits in high-stakes pipelines.
draftcat ships:
pdf.go:PDFParser,PDFDoc/PDFPage/TextItem/Span,FindSpanwith the four-layer ladder,FormatPDFForPromptmain.go:pdf_extractandpdf_verify_citedispatcher cases,citeTagRevalidate.go:validKnownActionsregistration sodraftcat validateaccepts these actionspdf_test.go:TestFindSpan_LayeredandTestCiteTagRegexcovering the match ladder and the regex
MIT license. Go module: github.com/renezander030/draftcat. Build with go build, run a pipeline with ./draftcat run pipelines/invoice-extract.yaml.
If you have shipped cite verification in production, drop a comment with what you learned:
- Are you using tesseract or another OCR for scanned-PDF preprocessing? Which threshold settings survived contact with real-world inputs?
- What fuzzy-match thresholds did you settle on? Did you go beyond exact / whitespace / currency / alphanum, or simpler?
- LangChain ships a
RAGASfaithfulness eval. Have you compared its scoring to a deterministic cite-resolver on the same corpus? - For tables, did you pre-extract with
tabula/camelot/pdfplumberbefore handing the cell grid to the LLM, or did you train the prompt to cite cell coordinates?
- #1 Agent Approval Gates: draft, validate, approve, dispatch, audit
- #2 Token Budgets: per-step and per-day caps, pre-flight checks
- #3 Agentic Task System: Qdrant + Ollama for skill recall
- #4 Driving CapCut from LLM agents: JSON-driven video composition
- #5 SQLite Dedup + Crash Safety: content-hash dedup, atomic writes
- #6 Prompt-Injection Defense: input quarantine, allowlist tools
- #7 PDF cite verification (this entry)
- skill_format reference: the YAML skill schema draftcat uses
This is Production AI Automation Notes #7. Follow @renezander030 for the next entry.
2026-05-25: initial publish