Production AI Automation Notes #7: PDF cite verification \u2014 making LLM extraction auditable with per-fragment bounding boxes

Production AI Automation Notes #7: PDF cite verification — making LLM extraction auditable with per-fragment bounding boxes

Last tested: May 2026. ledongthuc/pdf v0.0.0-2025, draftcat v0.1.

Follow @renezander030 for the rest of this series on production patterns for AI agents that touch real systems.

When an LLM extracts an invoice total, a contract clause, or a due-diligence claim from a PDF, you have two options: trust the model, or verify it. PDF cite verification gives you a deterministic step between extraction and trust. Every claim the model emits carries a <cite> tag, and a pure-Go resolver confirms that tag points to actual text on the actual page, with a bounding box you can render.

This entry walks through the two-action pipeline shipped in draftcat: pdf_extract and pdf_verify_cite, plus the four-layer fuzzy match that makes resolution tolerant of model paraphrasing without going so loose that hallucinations slip through.

Series

This is part 7 of Production AI Automation Notes:

#1 Agent Approval Gates
#2 Token Budgets (publishing alongside)
#3 Agentic Task System
#4 Driving CapCut from LLM agents
#5 SQLite Dedup + Crash Safety (publishing alongside)
#6 Prompt-Injection Defense (publishing alongside)
#7 PDF cite verification (this entry)

TL;DR cheat sheet

step	input	output	failure mode
`pdf_extract`	`vars.path` or `data.pdf_path`	`pdf_doc`, `pdf_text`, `pdf_filename`	open error, max_pages exceeded
`ai` step	`pdf_text` + cite-emitting prompt	`ai_raw` with `<cite file="X" page="N">claim</cite>`	model omits cite tags
`pdf_verify_cite`	`pdf_doc` + `ai_raw`	`citations` (file, page, text, resolved, span), `citations_ok`	unresolved cite when `fail_on_unresolved=true` aborts the run

Why PDF cite verification matters

Three production cases where this is non-optional:

Invoice line-item extraction. The model says "total: $13,695.00." If that number does not appear on page 1 of the invoice, you have a hallucination that will land in your AP system.
Contract review. A summary that says "termination requires 30 days written notice" is fiction if the clause is not on the cited page.
Due-diligence summaries. A risk flag attributed to "page 12" needs to actually be on page 12 or the auditor's report is a liability.

Cite verification turns LLM extraction from "trust me" into "here is the bounding box on the rendered page." A UI overlay highlights the exact x/y rectangle the model is referencing. A reviewer clicks through every claim in seconds. A failed resolution kills the pipeline before the bad data leaves the deterministic layer.

The two-action pipeline

The pattern in draftcat is three steps: deterministic parse, AI extraction with a cite-emitting prompt, deterministic verification. No model touches the verification logic. That is the whole point.

name: invoice-extract
steps:
  - name: parse
    type: action
    action: pdf_extract
    vars:
      path: ./invoices/acme-2026-05.pdf

  - name: extract
    type: ai
    skill: invoice_line_items

  - name: verify
    type: action
    action: pdf_verify_cite
    vars:
      fail_on_unresolved: "true"

Both pdf_extract and pdf_verify_cite are pure Go. No API call, no network, no token cost. They run in milliseconds and they cannot hallucinate.

Step 1: pdf_extract

pdf_extract opens the file with ledongthuc/pdf, walks every page, and emits a structured PDFDoc with per-fragment bounding boxes. Coordinates use top-left origin (Y flipped from raw PDF) so a browser overlay can draw rectangles directly.

type PDFDoc struct {
    Filename string    `json:"filename"`
    Pages    []PDFPage `json:"pages"`
}

type PDFPage struct {
    Num    int        `json:"num"`
    Width  float64    `json:"width"`
    Height float64    `json:"height"`
    Text   string     `json:"text"`
    Items  []TextItem `json:"items"`
}

type TextItem struct {
    Text     string  `json:"text"`
    X        float64 `json:"x"`
    Y        float64 `json:"y"`
    W        float64 `json:"w"`
    H        float64 `json:"h"`
    FontName string  `json:"font_name"`
    FontSize float64 `json:"font_size"`
}

The dispatcher writes three keys into pipeline data: pdf_doc (the full struct, used by the verifier), pdf_text (formatted with page markers for the LLM), and pdf_filename (used by the verifier to reject cross-document cites).

case "pdf_extract":
    path := step.Vars["path"]
    if path == "" {
        if v, ok := data["pdf_path"].(string); ok { path = v }
    }
    if path == "" {
        return fmt.Errorf("[step:%s] pdf_extract requires path var or data[pdf_path]", step.Name)
    }
    doc, err := pdfParser.Extract(path)
    if err != nil {
        return fmt.Errorf("[step:%s] pdf extract failed: %w", step.Name, err)
    }
    data["pdf_doc"] = doc
    data["pdf_filename"] = doc.Filename
    data["pdf_text"] = FormatPDFForPrompt(doc)

FormatPDFForPrompt inserts --- Page N --- markers so the model knows what page number to put in its cite tags. Per-page text is capped at 4000 chars to keep token budgets sane.

Step 2: the cite-emitting prompt

The model never sees raw fragment coordinates. It sees prompt-formatted text with page markers, and it is instructed to emit one cite tag per claim. Example skill prompt:

You are an invoice line-item extractor. Read the document below and emit
one bullet per line item.

For every numeric value you cite (subtotal, total, line item, tax),
wrap the verbatim string in:
<cite file="{{pdf_filename}}" page="N">EXACT TEXT</cite>

Rules:
- "EXACT TEXT" must be copied character-for-character from the page.
- "N" is the page number from the --- Page N --- marker.
- Do not paraphrase numbers. Do not add a $ sign that is not there.
- If a claim cannot be cited, omit the claim.

Document:
{{pdf_text}}

The "EXACT TEXT" rule is load-bearing. The verifier's four-layer match tolerates whitespace and punctuation differences, but the closer the model stays to verbatim, the higher the resolution rate.

Step 3: pdf_verify_cite

The verifier extracts every cite tag from ai_raw with one regex, then resolves each one against pdf_doc using a four-layer fuzzy match.

The regex:

var citeTagRe = regexp.MustCompile(`<cite file="([^"]+)" page="(\d+)">([^<]+)</cite>`)

The dispatcher loop:

case "pdf_verify_cite":
    doc, ok := data["pdf_doc"].(*PDFDoc)
    if !ok {
        return fmt.Errorf("[step:%s] pdf_verify_cite needs an earlier pdf_extract step", step.Name)
    }
    raw, _ := data["ai_raw"].(string)
    cites := citeTagRe.FindAllStringSubmatch(raw, -1)
    var unresolved []string
    results := make([]map[string]interface{}, 0, len(cites))
    for _, m := range cites {
        file, pageStr, text := m[1], m[2], m[3]
        page, _ := strconv.Atoi(pageStr)
        entry := map[string]interface{}{
            "file": file, "page": page, "text": text, "resolved": false,
        }
        if file != doc.Filename {
            unresolved = append(unresolved, fmt.Sprintf("%s p%d (wrong file)", file, page))
            results = append(results, entry)
            continue
        }
        if span, found := pdfParser.FindSpan(doc, page, text); found {
            entry["resolved"] = true
            entry["span"] = span
        } else {
            unresolved = append(unresolved, fmt.Sprintf("%s p%d: %q", file, page, text))
        }
        results = append(results, entry)
    }
    data["citations"] = results
    data["citations_ok"] = len(unresolved) == 0
    if len(unresolved) > 0 && step.Vars["fail_on_unresolved"] == "true" {
        return fmt.Errorf("[step:%s] unresolved citations: %s",
            step.Name, strings.Join(unresolved, "; "))
    }

FindSpan tries four normalisations in order and returns on the first hit:

exact: strings.ToLower. Verbatim match modulo case.
whitespace: collapse runs of whitespace to single space. Handles models that emit Total $13,695.00 when the PDF has tabs.
currency: strip $, €, £, and commas. Handles 13695.00 vs $13,695.00.
alphanum: keep only letters and digits. Last-resort match.

Each layer returns the union bounding box of the matched fragments plus the strategy name (exact, whitespace, currency, alphanum) so callers can downweight loose matches when scoring confidence.

The four-layer ladder is covered by TestFindSpan_Layered in pdf_test.go:

cases := []struct{ name, needle, wantStrat string }{
    {"exact verbatim", "$13,695.00", "exact"},
    {"whitespace-flexible", "Total   $13,695.00", "whitespace"},
    {"currency stripped", "13695.00", "currency"},
    {"alphanumeric only", "Total $13695 00", "alphanum"},
}

fail_on_unresolved: the boolean that matters

One pipeline variable decides whether unresolved cites kill the run or just log them:

vars:
  fail_on_unresolved: "true"   # compliance: any miss aborts

vars:
  fail_on_unresolved: "false"  # triage: log misses, keep going

Use true for any pipeline whose output ships to a downstream system: AP, CRM, contract DB. A single unresolved cite means the model fabricated a claim and the deterministic layer caught it. Better to fail loud than write fiction to a system of record.

Use false for exploratory pipelines. A paralegal scanning 200 contracts for "any clauses worth a closer look" wants unresolved cites flagged, not the run killed. Default is false. Production pipelines should set it explicitly either way.

What this doesn't catch

Honest limits, because the verifier is not a hallucination oracle:

In-context hallucinations. The model can cite real text out of context. If page 3 has boilerplate "termination notice: 30 days" and the operative clause on page 17 says "90 days," the page-3 cite resolves happily. Cite verification proves text-on-page, not semantic correctness.
Scanned PDFs. ledongthuc/pdf reads embedded text. A scanned image-PDF produces zero items per page. Pipe through ocrmypdf or equivalent first, then run pdf_extract on the OCR output. Quality of the verification follows quality of the OCR.
Tables and multi-column layouts. Reading order in PDFs is publisher-dependent. A two-column page may interleave columns in pg.Text, breaking whitespace-layer matches even when the alphanum layer rescues them. Tables with cell-by-cell extraction often need a structured parser like tabula upstream.
Loose alphanum matches. The alphanum layer strips everything except letters and digits. Total 13695 00 matches both $13,695.00 and any coincidental sequence on a different line. The Strategy field in Span lets you reject alphanum hits in high-stakes pipelines.

Reference implementation

draftcat ships:

pdf.go: PDFParser, PDFDoc / PDFPage / TextItem / Span, FindSpan with the four-layer ladder, FormatPDFForPrompt
main.go: pdf_extract and pdf_verify_cite dispatcher cases, citeTagRe
validate.go: validKnownActions registration so draftcat validate accepts these actions
pdf_test.go: TestFindSpan_Layered and TestCiteTagRegex covering the match ladder and the regex

MIT license. Go module: github.com/renezander030/draftcat. Build with go build, run a pipeline with ./draftcat run pipelines/invoice-extract.yaml.

Reader contributions

If you have shipped cite verification in production, drop a comment with what you learned:

Are you using tesseract or another OCR for scanned-PDF preprocessing? Which threshold settings survived contact with real-world inputs?
What fuzzy-match thresholds did you settle on? Did you go beyond exact / whitespace / currency / alphanum, or simpler?
LangChain ships a RAGAS faithfulness eval. Have you compared its scoring to a deterministic cite-resolver on the same corpus?
For tables, did you pre-extract with tabula / camelot / pdfplumber before handing the cell grid to the LLM, or did you train the prompt to cite cell coordinates?

#1 Agent Approval Gates: draft, validate, approve, dispatch, audit
#2 Token Budgets: per-step and per-day caps, pre-flight checks
#3 Agentic Task System: Qdrant + Ollama for skill recall
#4 Driving CapCut from LLM agents: JSON-driven video composition
#5 SQLite Dedup + Crash Safety: content-hash dedup, atomic writes
#6 Prompt-Injection Defense: input quarantine, allowlist tools
#7 PDF cite verification (this entry)
skill_format reference: the YAML skill schema draftcat uses

This is Production AI Automation Notes #7. Follow @renezander030 for the next entry.

Changelog

2026-05-25: initial publish

renezander030/paan_07.md

Select an option

No results found