Skip to content

Instantly share code, notes, and snippets.

@jpatel3
Last active March 19, 2026 19:15
Show Gist options
  • Select an option

  • Save jpatel3/1b69bb44f5e50c7de111641ae291834d to your computer and use it in GitHub Desktop.

Select an option

Save jpatel3/1b69bb44f5e50c7de111641ae291834d to your computer and use it in GitHub Desktop.
pdfOracle - Static PDF to Fillable Form Converter - Project Plan
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>pdfOracle - Updated Project Plan</title>
<style>
:root {
--bg: #0d1117;
--surface: #161b22;
--border: #30363d;
--text: #e6edf3;
--text-muted: #8b949e;
--accent: #58a6ff;
--accent-green: #3fb950;
--accent-orange: #d29922;
--accent-red: #f85149;
--accent-purple: #bc8cff;
--code-bg: #1c2128;
}
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif;
background: var(--bg);
color: var(--text);
line-height: 1.6;
padding: 2rem;
max-width: 1100px;
margin: 0 auto;
}
h1 {
font-size: 2.2rem;
margin-bottom: 0.25rem;
background: linear-gradient(135deg, var(--accent), var(--accent-purple));
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
}
.subtitle {
color: var(--text-muted);
font-size: 1.1rem;
margin-bottom: 2rem;
border-bottom: 1px solid var(--border);
padding-bottom: 1.5rem;
}
h2 {
font-size: 1.5rem;
margin: 2.5rem 0 1rem 0;
color: var(--accent);
display: flex;
align-items: center;
gap: 0.5rem;
flex-wrap: wrap;
}
h2 .badge {
font-size: 0.7rem;
padding: 2px 8px;
border-radius: 12px;
font-weight: 600;
text-transform: uppercase;
letter-spacing: 0.5px;
}
.badge-green { background: rgba(63,185,80,0.15); color: var(--accent-green); }
.badge-orange { background: rgba(210,153,34,0.15); color: var(--accent-orange); }
.badge-purple { background: rgba(188,140,255,0.15); color: var(--accent-purple); }
.badge-red { background: rgba(248,81,73,0.15); color: var(--accent-red); }
.badge-new { background: rgba(248,81,73,0.2); color: var(--accent-red); font-weight: 800; }
h3 {
font-size: 1.15rem;
margin: 1.5rem 0 0.75rem 0;
color: var(--text);
}
p, li { color: var(--text); margin-bottom: 0.5rem; }
.card {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 8px;
padding: 1.5rem;
margin: 1rem 0;
}
.card-highlight { border-left: 3px solid var(--accent); }
.card-green { border-left: 3px solid var(--accent-green); }
.card-orange { border-left: 3px solid var(--accent-orange); }
.card-red { border-left: 3px solid var(--accent-red); }
table {
width: 100%;
border-collapse: collapse;
margin: 1rem 0;
font-size: 0.9rem;
}
th {
background: var(--code-bg);
color: var(--accent);
text-align: left;
padding: 10px 14px;
border: 1px solid var(--border);
font-weight: 600;
}
td {
padding: 10px 14px;
border: 1px solid var(--border);
vertical-align: top;
}
tr:nth-child(even) td { background: rgba(22,27,34,0.5); }
code {
background: var(--code-bg);
padding: 2px 6px;
border-radius: 4px;
font-family: 'SF Mono', 'Fira Code', monospace;
font-size: 0.85em;
color: var(--accent-orange);
}
pre {
background: var(--code-bg);
border: 1px solid var(--border);
border-radius: 8px;
padding: 1.25rem;
overflow-x: auto;
font-family: 'SF Mono', 'Fira Code', monospace;
font-size: 0.85rem;
line-height: 1.5;
margin: 1rem 0;
color: var(--text-muted);
}
.architecture-diagram {
background: var(--code-bg);
border: 1px solid var(--border);
border-radius: 8px;
padding: 1.5rem;
margin: 1.5rem 0;
overflow-x: auto;
text-align: center;
}
.flow-box {
display: inline-block;
background: var(--surface);
border: 2px solid var(--accent);
border-radius: 8px;
padding: 12px 20px;
margin: 6px;
font-weight: 600;
font-size: 0.9rem;
}
.flow-box-secondary {
border-color: var(--accent-purple);
font-size: 0.85rem;
}
.flow-box-new {
border-color: var(--accent-red);
box-shadow: 0 0 12px rgba(248,81,73,0.2);
}
.flow-arrow {
display: block;
color: var(--text-muted);
font-size: 1.5rem;
margin: 4px 0;
}
.flow-row {
display: flex;
justify-content: center;
gap: 12px;
flex-wrap: wrap;
margin: 6px 0;
}
.phase-header {
display: flex;
justify-content: space-between;
align-items: baseline;
flex-wrap: wrap;
gap: 0.5rem;
}
.timeline {
color: var(--text-muted);
font-size: 0.85rem;
font-style: italic;
}
.file-tree {
font-family: 'SF Mono', 'Fira Code', monospace;
font-size: 0.85rem;
line-height: 1.8;
color: var(--text-muted);
}
.file-tree .dir { color: var(--accent); font-weight: 600; }
.file-tree .file { color: var(--text); }
.file-tree .file-new { color: var(--accent-red); font-weight: 600; }
.file-tree .comment { color: var(--text-muted); font-style: italic; }
ul { padding-left: 1.5rem; margin: 0.5rem 0; }
li { margin-bottom: 0.35rem; }
.grid-2 {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 1rem;
}
@media (max-width: 768px) {
.grid-2 { grid-template-columns: 1fr; }
body { padding: 1rem; }
h1 { font-size: 1.6rem; }
}
.cost-tag {
display: inline-block;
background: rgba(63,185,80,0.15);
color: var(--accent-green);
padding: 2px 10px;
border-radius: 12px;
font-size: 0.8rem;
font-weight: 700;
}
.new-tag {
display: inline-block;
background: rgba(248,81,73,0.2);
color: var(--accent-red);
padding: 1px 6px;
border-radius: 4px;
font-size: 0.7rem;
font-weight: 700;
text-transform: uppercase;
letter-spacing: 0.5px;
vertical-align: middle;
margin-left: 4px;
}
.approval-box {
background: linear-gradient(135deg, rgba(88,166,255,0.1), rgba(188,140,255,0.1));
border: 2px solid var(--accent);
border-radius: 12px;
padding: 2rem;
text-align: center;
margin: 3rem 0 2rem 0;
}
.approval-box h2 {
justify-content: center;
margin-top: 0;
}
.btn {
display: inline-block;
padding: 10px 28px;
border-radius: 8px;
font-weight: 600;
font-size: 1rem;
text-decoration: none;
margin: 0.5rem;
cursor: default;
}
.btn-approve { background: var(--accent-green); color: #000; }
.btn-revise { background: var(--border); color: var(--text); }
.tag-row { display: flex; gap: 0.5rem; flex-wrap: wrap; margin: 0.5rem 0; }
.tag {
font-size: 0.75rem;
padding: 2px 8px;
border-radius: 4px;
background: var(--code-bg);
border: 1px solid var(--border);
color: var(--text-muted);
}
.comparison-better { color: var(--accent-green); font-weight: 600; }
.comparison-same { color: var(--text-muted); }
.comparison-na { color: var(--text-muted); font-style: italic; }
.workflow-box {
background: var(--code-bg);
border: 1px solid var(--border);
border-radius: 8px;
padding: 1rem 1.5rem;
margin: 0.75rem 0;
font-family: 'SF Mono', 'Fira Code', monospace;
font-size: 0.85rem;
color: var(--text-muted);
}
.workflow-box strong { color: var(--text); }
.workflow-box .highlight { color: var(--accent-red); font-weight: 600; }
.context-grid {
display: grid;
grid-template-columns: 1fr 1fr 1fr;
gap: 1rem;
margin: 1rem 0;
}
@media (max-width: 900px) {
.context-grid { grid-template-columns: 1fr; }
}
.context-card {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 8px;
padding: 1.25rem;
}
.context-card h4 {
color: var(--accent);
margin-bottom: 0.5rem;
font-size: 0.95rem;
}
.context-card p {
font-size: 0.85rem;
color: var(--text-muted);
}
.mapping-flow {
display: flex;
align-items: center;
justify-content: center;
gap: 8px;
flex-wrap: wrap;
margin: 1rem 0;
padding: 1rem;
background: var(--code-bg);
border-radius: 8px;
border: 1px solid var(--border);
}
.mapping-flow .step {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 6px;
padding: 8px 14px;
font-size: 0.8rem;
text-align: center;
}
.mapping-flow .arrow {
color: var(--text-muted);
font-size: 1.2rem;
}
.mapping-flow .step-highlight {
border-color: var(--accent-red);
background: rgba(248,81,73,0.08);
}
</style>
</head>
<body>
<h1>pdfOracle</h1>
<p class="subtitle">
Static PDF &rarr; Fillable Form Converter <strong>with Fiserv XP2 / MISMO Variable Mapping</strong><br>
<small>In-House LiquidOffice Replacement &mdash; Open Source, Zero Token Cost</small>
</p>
<!-- ───────────────── OVERVIEW ───────────────── -->
<div class="card card-highlight">
<h3>What This Does</h3>
<p>Replaces the <strong>LiquidOffice Form Designer</strong> workflow: takes static PDFs from banks, auto-detects fillable fields, <strong>maps each field to Fiserv XP2 / MISMO standard variables</strong>, and returns the mapped fillable PDF + audit manifest to the institution.</p>
<div class="tag-row">
<span class="tag">Field Auto-Detection</span>
<span class="tag">XP2 Lending Mapping</span>
<span class="tag">XP2 MSP Mapping</span>
<span class="tag">MISMO Standard</span>
<span class="tag">Operator Review UI</span>
<span class="tag">Mapping Manifest Export</span>
<span class="tag">100% Format Preservation</span>
<span class="cost-tag">$0 Token Cost</span>
</div>
</div>
<!-- ───────────────── WORKFLOW COMPARISON ───────────────── -->
<h2>Workflow: Current vs. Target</h2>
<div class="workflow-box">
<strong>Current (LiquidOffice):</strong><br>
Bank sends PDF &rarr; LiquidOffice Form Designer &rarr; <strong>Manual</strong> field creation &rarr; <strong>Manual</strong> variable mapping (XP2/MISMO) &rarr; Return fillable PDF
</div>
<div class="workflow-box" style="border-color: var(--accent-green);">
<strong>Target (pdfOracle):</strong><br>
Bank sends PDF &rarr; pdfOracle <span class="highlight">auto-detects</span> fields &rarr; <span class="highlight">auto-suggests</span> variable mappings &rarr; Operator <strong>reviews/approves</strong> &rarr; Export fillable PDF + mapping manifest &rarr; Return to bank
</div>
<!-- ───────────────── INDUSTRY CONTEXT ───────────────── -->
<h2>Industry Context <span class="badge badge-new">Updated</span></h2>
<div class="context-grid">
<div class="context-card">
<h4>Fiserv XP2</h4>
<p>Credit union core banking platform (.NET + DB2). Field specs in WSDL (SOAP/XML) &mdash; <strong>NDA-protected</strong>, provided to licensed customers. Your org should have the XP2 XML guide with all data elements.</p>
</div>
<div class="context-card">
<h4>MISMO Standard</h4>
<p>Canonical industry standard for mortgage/lending field naming. <strong>UpperCamelCase</strong> convention: <code>BorrowerFullName</code>, <code>LoanAmount</code>. Suffixes: <code>Indicator</code>=bool, <code>Amount</code>=money, <code>Date</code>=date, <code>Type</code>=enum.</p>
</div>
<div class="context-card">
<h4>LiquidOffice (Replacing)</h4>
<p>OpenText product, still active (v25.2) but aging UI. ~&pound;36K/license. We replace the <strong>form design + variable mapping</strong> workflow only &mdash; NOT the BPM/routing.</p>
</div>
</div>
<!-- ───────────────── ARCHITECTURE ───────────────── -->
<h2>Architecture <span class="badge badge-new">Updated</span></h2>
<div class="architecture-diagram">
<div class="flow-box">Bank Sends Static PDF</div>
<span class="flow-arrow">&darr;</span>
<div class="flow-box" style="border-color: var(--accent-green);">1. PDF Parser (PyMuPDF)<br><small style="color:var(--text-muted)">Text spans, lines, rects, fonts, coordinates</small></div>
<span class="flow-arrow">&darr;</span>
<div class="flow-row">
<div class="flow-box flow-box-secondary">Underline<br>Detector</div>
<div class="flow-box flow-box-secondary">Box / Rect<br>Detector</div>
<div class="flow-box flow-box-secondary">Placeholder<br>Text Detector</div>
<div class="flow-box flow-box-secondary">Table Grid<br>Detector</div>
</div>
<span class="flow-arrow">&darr;</span>
<div class="flow-box" style="border-color: var(--accent-orange);">3. Field Classifier + Label Associator<br><small style="color:var(--text-muted)">Type: text | checkbox | date | signature | number</small></div>
<span class="flow-arrow">&darr;</span>
<div class="flow-box flow-box-new">
4. Variable Mapping Engine<span class="new-tag">NEW</span><br>
<small style="color:var(--text-muted)">
"Applicant Name" &rarr; <code style="font-size:0.8em">BorrowerFullName</code> (MISMO)<br>
"Date" &rarr; <code style="font-size:0.8em">ApplicationReceivedDate</code> (MISMO)<br>
"Rs.___" &rarr; <code style="font-size:0.8em">DepositAmount</code> (XP2)<br>
Registry + fuzzy matching &mdash; no AI tokens
</small>
</div>
<span class="flow-arrow">&darr;</span>
<div class="flow-box flow-box-new">
5. Operator Review<span class="new-tag">NEW</span><br>
<small style="color:var(--text-muted)">Approve / edit each mapping &bull; Confidence color-coding</small>
</div>
<span class="flow-arrow">&darr;</span>
<div class="flow-box" style="border-color: var(--accent-green);">6. AcroForm Widget Writer (PyMuPDF)<br><small style="color:var(--text-muted)">field_name = mapped variable &bull; Annotation layer only</small></div>
<span class="flow-arrow">&darr;</span>
<div class="flow-row">
<div class="flow-box">Fillable PDF<br><small style="color:var(--text-muted)">Original format preserved<br>Fields named with spec vars</small></div>
<div class="flow-box flow-box-new">Mapping Manifest<span class="new-tag">NEW</span><br><small style="color:var(--text-muted)">JSON/CSV audit trail<br>field &rarr; variable &rarr; spec</small></div>
</div>
</div>
<!-- ───────────────── MAPPING ENGINE ───────────────── -->
<h2>Variable Mapping Engine <span class="badge badge-red">Core Differentiator</span></h2>
<p>The key value-add isn't just "make fillable" &mdash; it's the <strong>field-to-variable mapping layer</strong> against financial specs. No AI required.</p>
<div class="mapping-flow">
<div class="step">PDF Label<br><code>"please insert name"</code></div>
<span class="arrow">&rarr;</span>
<div class="step">Normalize<br><code>"applicant name"</code></div>
<span class="arrow">&rarr;</span>
<div class="step step-highlight">Fuzzy Match<br>against registry</div>
<span class="arrow">&rarr;</span>
<div class="step">Spec Variable<br><code>BorrowerFullName</code></div>
<span class="arrow">&rarr;</span>
<div class="step">Operator<br>Review</div>
</div>
<div class="card card-red">
<h3>How Mapping Works (Zero AI)</h3>
<table>
<tr><th>Step</th><th>Method</th><th>Example</th></tr>
<tr><td>1. Exact match</td><td>Label = known alias verbatim</td><td>"borrower name" &rarr; <code>BorrowerFullName</code></td></tr>
<tr><td>2. Alias match</td><td>Normalized label matches alias</td><td>"Name of Applicant" &rarr; <code>BorrowerFullName</code></td></tr>
<tr><td>3. Fuzzy match</td><td><code>rapidfuzz</code> token_sort_ratio &gt;80%</td><td>"Applicant's Full Name" &rarr; <code>BorrowerFullName</code></td></tr>
<tr><td>4. Keyword + type</td><td>Label keywords + field type</td><td>"date" + type=date &rarr; <code>ApplicationReceivedDate</code></td></tr>
<tr><td>5. Context clue</td><td>Surrounding page text</td><td>"business of ___" &rarr; <code>NatureOfBusinessDescription</code></td></tr>
</table>
</div>
<h3>Mapping Registry Format</h3>
<pre>
# mappings/mismo_lending.yaml
spec: MISMO
version: "3.4"
mappings:
- variable: BorrowerFullName
type: text
container: "PARTY > INDIVIDUAL > NAME"
aliases:
- "applicant name"
- "borrower name"
- "please insert name"
- "full name"
- variable: ApplicationReceivedDate
type: date
container: "LOAN > LOAN_DETAIL"
aliases:
- "date"
- "application date"
- "please insert date"
- variable: BusinessName
type: text
container: "PARTY > LEGAL_ENTITY"
aliases:
- "business name"
- "company name"
- "M/s"
- variable: DepositAmount
type: number
container: "ASSET > ASSET_DETAIL"
aliases:
- "depositing a sum"
- "deposit amount"
- "Rs."
</pre>
<h3>What You Need to Provide</h3>
<div class="card card-orange">
<table>
<tr><th>Item</th><th>Why</th><th>Format</th></tr>
<tr><td>XP2 WSDL file</td><td>Extract all XP2 field names and types</td><td>XML</td></tr>
<tr><td>XP2 XML guide</td><td>Field descriptions and business rules</td><td>PDF/doc</td></tr>
<tr><td>MISMO LDD workbook (if available)</td><td>Standard field dictionary</td><td>Excel</td></tr>
<tr><td>Existing LiquidOffice form exports</td><td>Seed registry with proven label&rarr;variable pairs</td><td>XML/CSV</td></tr>
<tr><td>Any existing mapping spreadsheets</td><td>Common mappings your team already uses</td><td>Excel/CSV</td></tr>
</table>
</div>
<!-- ───────────────── TECH STACK ───────────────── -->
<h2>Tech Stack</h2>
<table>
<tr><th>Component</th><th>Choice</th><th>License</th><th>Why</th></tr>
<tr><td>Language</td><td><code>Python 3.11+</code></td><td>&mdash;</td><td>Best PDF library ecosystem</td></tr>
<tr><td>PDF Read/Write</td><td><code>PyMuPDF (fitz)</code></td><td>AGPL</td><td>Single lib for extraction + widget writing</td></tr>
<tr><td>Fuzzy Matching</td><td><code>rapidfuzz</code><span class="new-tag">NEW</span></td><td>MIT</td><td>Label &rarr; variable fuzzy matching</td></tr>
<tr><td>Mapping Registry</td><td><code>YAML + PyYAML</code><span class="new-tag">NEW</span></td><td>MIT</td><td>Human-editable, versionable mapping files</td></tr>
<tr><td>ML Detection</td><td><code>CommonForms/FFDNet</code></td><td>MIT</td><td>YOLO11, outperforms Adobe, runs locally</td></tr>
<tr><td>Web Framework</td><td><code>FastAPI</code></td><td>MIT</td><td>Async file upload, auto API docs</td></tr>
<tr><td>Frontend</td><td><code>Vanilla HTML/JS + pdf.js</code></td><td>Apache 2.0</td><td>PDF preview + mapping review UI</td></tr>
<tr><td>WSDL Parser</td><td><code>lxml</code><span class="new-tag">NEW</span></td><td>BSD</td><td>Import XP2 field specs from WSDL</td></tr>
<tr><td>Packaging</td><td><code>Docker</code></td><td>&mdash;</td><td>Reproducible deployment</td></tr>
</table>
<!-- ───────────────── PHASES ───────────────── -->
<!-- PHASE 1 -->
<h2>Phase 1 &mdash; Core Engine + Variable Mapping <span class="badge badge-green">No AI</span> <span class="badge badge-new">Updated</span></h2>
<div class="phase-header">
<p><strong>Goal:</strong> CLI + library that detects fields AND maps to spec variables</p>
<span class="timeline">Week 1&ndash;2</span>
</div>
<div class="card">
<h3>Project Structure</h3>
<div class="file-tree">
<span class="dir">pdfOracle/</span><br>
&nbsp;&nbsp;<span class="file">pyproject.toml</span> &nbsp; <span class="file">Dockerfile</span><br>
&nbsp;&nbsp;<span class="dir">src/pdforacle/</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="file">cli.py</span> <span class="comment"># CLI entry point</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="file">parser.py</span> <span class="comment"># PDF extraction</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="file">detector.py</span> <span class="comment"># Field detection rules</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="file">classifier.py</span> <span class="comment"># Field type classification</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="file">labeler.py</span> <span class="comment"># Label-to-field association</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="file-new">mapper.py</span> <span class="comment"># ★ Variable mapping engine</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="file-new">exporter.py</span> <span class="comment"># ★ Mapping manifest export</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="file">writer.py</span> <span class="comment"># AcroForm widget writer</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="file">models.py</span> <span class="comment"># Data classes</span><br>
&nbsp;&nbsp;<span class="dir" style="color:var(--accent-red)">mappings/</span> <span class="comment"># ★ Variable mapping registries</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="file-new">mismo_lending.yaml</span> <span class="comment"># MISMO standard</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="file-new">mismo_servicing.yaml</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="file-new">xp2_lending.yaml</span> <span class="comment"># From your WSDL</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="file-new">xp2_msp.yaml</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="dir" style="color:var(--accent-red)">custom/</span> <span class="comment"># Per-institution overrides</span><br>
&nbsp;&nbsp;<span class="dir">tests/</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="file">test_parser.py</span> / <span class="file">test_detector.py</span> / <span class="file-new">test_mapper.py</span> / <span class="file">test_writer.py</span><br>
</div>
</div>
<h3>Detection Rules (unchanged)</h3>
<table>
<tr><th>Visual Pattern</th><th>Detection Rule</th><th>Field Type</th></tr>
<tr><td>Horizontal line near text</td><td>Line width &gt;50pt, height &lt;2pt</td><td>Text Input</td></tr>
<tr><td>Consecutive <code>______</code> chars</td><td>3+ underscore characters</td><td>Text Input</td></tr>
<tr><td>Empty rectangle</td><td>No fill, aspect &gt;3:1</td><td>Text Input</td></tr>
<tr><td>Small square (8&ndash;14pt)</td><td>Aspect ~1:1, width 8&ndash;14pt</td><td>Checkbox</td></tr>
<tr><td>Table grid cells</td><td>Line intersections</td><td>Text Inputs</td></tr>
<tr><td><code>(Please insert...)</code></td><td>Parenthetical hint</td><td>Label hint</td></tr>
</table>
<h3>CLI Usage <span class="badge badge-new">Updated</span></h3>
<pre>
# Detect + map + convert (MISMO spec)
pdforacle convert input.pdf -o output.pdf --spec mismo
# Preview detected fields + proposed mappings
pdforacle convert input.pdf --preview --spec xp2_lending
# Export mapping manifest only
pdforacle map input.pdf --spec mismo --format json -o manifest.json
# Use custom institution mappings
pdforacle convert input.pdf --spec custom/example_bank
# List available specs
pdforacle specs list
</pre>
<h3>Mapping Manifest Output <span class="new-tag">NEW</span></h3>
<pre>
{
"source_pdf": "LCC_Account_Opening_Application.pdf",
"spec": "mismo",
"spec_version": "3.4",
"mapped_fields": [
{
"pdf_field_name": "BorrowerFullName",
"label": "please insert name",
"variable": "BorrowerFullName",
"container": "PARTY > INDIVIDUAL > NAME",
"field_type": "text",
"page": 1,
"mapping_confidence": 0.92
}
],
"mapping_stats": {
"total_fields": 8,
"auto_mapped": 6,
"needs_review": 2
}
}
</pre>
<!-- PHASE 2 -->
<h2>Phase 2 &mdash; ML-Assisted Detection <span class="badge badge-orange">Local Model &bull; Still $0</span></h2>
<div class="phase-header">
<p><strong>Goal:</strong> Boost detection accuracy on diverse/messy forms</p>
<span class="timeline">Week 3</span>
</div>
<div class="card card-orange">
<ul>
<li><code>ml_detector.py</code> &mdash; wraps CommonForms/FFDNet (YOLO11)</li>
<li><code>ensemble.py</code> &mdash; combines heuristic + ML, snaps to PDF primitives</li>
<li>Runs <strong>100% locally</strong> (CPU ~5-10s/page, GPU &lt;1s)</li>
</ul>
<table>
<tr><th>Scenario</th><th>Confidence</th><th>Action</th></tr>
<tr><td>Heuristic + ML agree</td><td style="color:var(--accent-green)">High</td><td>Use directly</td></tr>
<tr><td>Only one detects</td><td style="color:var(--accent-orange)">Medium</td><td>Use, flag for review</td></tr>
<tr><td>They disagree</td><td style="color:var(--accent-red)">Low</td><td>Flag for manual review</td></tr>
</table>
</div>
<!-- PHASE 3 -->
<h2>Phase 3 &mdash; Web UI + Mapping Review <span class="badge badge-purple">Operator Workflow</span> <span class="badge badge-new">Updated</span></h2>
<div class="phase-header">
<p><strong>Goal:</strong> Browser-based interface with mapping review/approval workflow</p>
<span class="timeline">Week 4</span>
</div>
<div class="card">
<h3>API Endpoints</h3>
<table>
<tr><th>Method</th><th>Endpoint</th><th>Description</th></tr>
<tr><td><code>POST</code></td><td>/api/upload</td><td>Upload PDF, returns job ID</td></tr>
<tr><td><code>GET</code></td><td>/api/detect/{id}</td><td>Detected fields as JSON</td></tr>
<tr><td><code>GET</code></td><td>/api/mappings/{id}</td><td><span class="new-tag">NEW</span> Proposed variable mappings</td></tr>
<tr><td><code>PUT</code></td><td>/api/mappings/{id}</td><td><span class="new-tag">NEW</span> Operator approves/edits mappings</td></tr>
<tr><td><code>POST</code></td><td>/api/convert/{id}</td><td>Generate fillable PDF with mapped vars</td></tr>
<tr><td><code>GET</code></td><td>/api/download/{id}</td><td>Download fillable PDF</td></tr>
<tr><td><code>GET</code></td><td>/api/manifest/{id}</td><td><span class="new-tag">NEW</span> Download mapping manifest</td></tr>
<tr><td><code>GET</code></td><td>/api/specs</td><td><span class="new-tag">NEW</span> List available mapping specs</td></tr>
</table>
<h3>Mapping Review UI <span class="new-tag">NEW</span></h3>
<ul>
<li>PDF preview with detected fields highlighted</li>
<li><strong>Mapping review panel</strong> for each field:
<ul>
<li>Field label (from PDF) &rarr; Proposed variable (from spec)</li>
<li>Confidence score (color: green / yellow / red)</li>
<li>Dropdown to select alternative mappings</li>
<li>Search box to find any spec variable manually</li>
</ul>
</li>
<li><strong>Spec selector:</strong> MISMO, XP2 Lending, XP2 MSP, custom</li>
<li>Approve all / approve individually</li>
<li>Download fillable PDF + mapping manifest</li>
<li>Corrections saved back to registry for future improvement</li>
</ul>
</div>
<!-- PHASE 4 -->
<h2>Phase 4 &mdash; Batch Processing + Learning <span class="badge badge-purple">Scale</span> <span class="badge badge-new">Updated</span></h2>
<div class="phase-header">
<p><strong>Goal:</strong> Process multiple PDFs, learn from operator corrections</p>
<span class="timeline">Week 5</span>
</div>
<div class="card">
<ul>
<li>Batch upload (<code>POST /api/batch</code>)</li>
<li>Async queue + webhook callbacks</li>
<li>API key auth for external integrations</li>
<li><span class="new-tag">NEW</span> <strong>Mapping memory:</strong> operator corrections feed back into registry</li>
<li><span class="new-tag">NEW</span> <strong>Institution profiles:</strong> per-bank mapping preferences</li>
<li>Processing stats + mapping accuracy dashboard</li>
</ul>
</div>
<!-- ───────────────── DECISIONS ───────────────── -->
<h2>Key Design Decisions</h2>
<table>
<tr><th>Decision</th><th>Choice</th><th>Rationale</th></tr>
<tr>
<td>Field names = mapped variables</td>
<td><span class="new-tag">NEW</span> Yes</td>
<td>AcroForm field name IS the spec variable (e.g., <code>BorrowerFullName</code>). This is what the bank consumes.</td>
</tr>
<tr>
<td>Registry-based mapping (no AI)</td>
<td><span class="new-tag">NEW</span> Yes</td>
<td>Fuzzy matching against curated registry is deterministic, auditable, and $0. AI is overkill here.</td>
</tr>
<tr>
<td>Multi-spec support</td>
<td><span class="new-tag">NEW</span> Yes</td>
<td>Same form can be mapped against MISMO, XP2 Lending, XP2 MSP, or custom specs.</td>
</tr>
<tr>
<td>Operator-in-the-loop</td>
<td><span class="new-tag">NEW</span> Yes</td>
<td>Auto-mapping suggests; human approves. Critical for financial forms where accuracy is non-negotiable.</td>
</tr>
<tr>
<td>Mapping manifest</td>
<td><span class="new-tag">NEW</span> Yes</td>
<td>JSON/CSV export of every field&rarr;variable mapping for audit trail and regulatory compliance.</td>
</tr>
<tr>
<td>Format preservation</td>
<td>Annotation layer only</td>
<td>Original PDF content never modified. 100% format preservation.</td>
</tr>
<tr>
<td>LiquidOffice scope</td>
<td>Partial replacement</td>
<td>Replace form design + mapping workflow. NOT replacing BPM/routing.</td>
</tr>
</table>
<!-- ───────────────── COMPARISON ───────────────── -->
<h2>LiquidOffice vs. pdfOracle</h2>
<table>
<tr><th>Capability</th><th>LiquidOffice</th><th>pdfOracle</th></tr>
<tr><td>Auto-detect form fields</td><td>No (manual)</td><td class="comparison-better">Yes (heuristic + ML)</td></tr>
<tr><td>Variable mapping</td><td>Manual only</td><td class="comparison-better">Auto-suggest + review</td></tr>
<tr><td>Multi-spec support</td><td>Limited</td><td class="comparison-better">MISMO, XP2, custom YAML</td></tr>
<tr><td>Mapping audit trail</td><td>Limited</td><td class="comparison-better">JSON/CSV/XLSX manifest</td></tr>
<tr><td>Format preservation</td><td>Yes</td><td class="comparison-same">Yes (annotation layer)</td></tr>
<tr><td>Batch processing</td><td>No</td><td class="comparison-better">Yes (Phase 4)</td></tr>
<tr><td>Learning from corrections</td><td>No</td><td class="comparison-better">Yes (registry feedback)</td></tr>
<tr><td>Per-institution profiles</td><td>No</td><td class="comparison-better">Yes (Phase 4)</td></tr>
<tr><td>License cost</td><td>~&pound;36K/license</td><td class="comparison-better">$0 (open source)</td></tr>
<tr><td>Token / API cost</td><td>N/A</td><td class="comparison-better">$0</td></tr>
<tr><td>BPM / Workflow routing</td><td>Yes</td><td class="comparison-na">Out of scope</td></tr>
<tr><td>Web form publishing</td><td>Yes</td><td class="comparison-na">Out of scope</td></tr>
</table>
<!-- ───────────────── METRICS ───────────────── -->
<h2>Success Criteria</h2>
<div class="grid-2">
<div class="card card-green">
<h3>Performance Targets</h3>
<table>
<tr><td>Format preservation</td><td><strong>100%</strong></td></tr>
<tr><td>Field detection (structured)</td><td>&gt;85% recall, &gt;90% precision</td></tr>
<tr><td>Field detection (with ML)</td><td>&gt;90% recall, &gt;90% precision</td></tr>
<tr><td>Variable mapping (seeded)</td><td>&gt;70% auto-mapped at &gt;80% confidence</td></tr>
<tr><td>Variable mapping (after feedback)</td><td>&gt;90% auto-mapped</td></tr>
<tr><td>Speed</td><td>&lt;2s / page (CPU)</td></tr>
<tr><td>Token cost</td><td><strong>$0</strong></td></tr>
</table>
</div>
<div class="card">
<h3>Dependencies</h3>
<pre style="margin:0;border:none;padding:0.5rem;">
pymupdf &gt;= 1.24.0 # PDF engine
click &gt;= 8.0 # CLI
pydantic &gt;= 2.0 # Data models
rapidfuzz &gt;= 3.0 # Fuzzy matching ★
pyyaml &gt;= 6.0 # Mapping registry ★
# Phase 2 (optional)
commonforms &gt;= 0.1 # ML detection
torch &gt;= 2.0 # PyTorch
# Phase 3 (optional)
fastapi &gt;= 0.110 # Web server
uvicorn &gt;= 0.29 # ASGI
# Registry builders
openpyxl &gt;= 3.1 # MISMO LDD Excel ★
lxml &gt;= 5.0 # XP2 WSDL XML ★</pre>
</div>
</div>
<!-- ───────────────── RISKS ───────────────── -->
<h2>Risks &amp; Mitigations</h2>
<table>
<tr><th>Risk</th><th>Mitigation</th></tr>
<tr><td>XP2 specs are NDA-protected</td><td>Tool imports from WSDL your org already has. We don't distribute specs.</td></tr>
<tr><td>MISMO LDD requires membership</td><td>Start with Fannie Mae/Freddie Mac public ULDD samples. Upgrade later.</td></tr>
<tr><td>Low mapping accuracy on first run</td><td>Seed from existing LiquidOffice exports. Operator review catches gaps.</td></tr>
<tr><td>PyMuPDF AGPL license</td><td>Fine for internal tools. Swap to pikepdf+ReportLab if distributing.</td></tr>
<tr><td>Scanned PDFs</td><td>Optional Tesseract OCR preprocessing.</td></tr>
<tr><td>Regulatory audit requirements</td><td>Mapping manifest provides full audit trail.</td></tr>
</table>
<!-- ───────────────── SCOPE ───────────────── -->
<h2>Out of Scope</h2>
<div class="card">
<ul>
<li>Does NOT modify or reflow original PDF content</li>
<li>Does NOT replace BPM/workflow routing (keep existing tool for that)</li>
<li>Does NOT distribute Fiserv NDA-protected specs (imports from your copy)</li>
<li>Does NOT handle scanned PDFs (without optional OCR add-on)</li>
<li>Does NOT require internet or API keys &mdash; fully offline capable</li>
</ul>
</div>
<!-- ───────────────── APPROVAL ───────────────── -->
<div class="approval-box">
<h2>Ready to Build?</h2>
<p style="color: var(--text-muted); margin-bottom: 1rem;">
Plan updated with Fiserv XP2 / MISMO variable mapping and LiquidOffice replacement context.<br>
Phase 1 builds the core engine + mapping. Awaiting your approval.
</p>
<span class="btn btn-approve">Approve &amp; Build Phase 1</span>
<span class="btn btn-revise">Request Changes</span>
</div>
<p style="text-align:center; color:var(--text-muted); font-size:0.8rem; margin-top: 2rem;">
pdfOracle &mdash; Updated Plan v2 &mdash; 2026-03-19
</p>
</body>
</html>

Comments are disabled for this gist.