Last active
March 19, 2026 19:15
-
-
Save jpatel3/1b69bb44f5e50c7de111641ae291834d to your computer and use it in GitHub Desktop.
pdfOracle - Static PDF to Fillable Form Converter - Project Plan
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| <!DOCTYPE html> | |
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <title>pdfOracle - Updated Project Plan</title> | |
| <style> | |
| :root { | |
| --bg: #0d1117; | |
| --surface: #161b22; | |
| --border: #30363d; | |
| --text: #e6edf3; | |
| --text-muted: #8b949e; | |
| --accent: #58a6ff; | |
| --accent-green: #3fb950; | |
| --accent-orange: #d29922; | |
| --accent-red: #f85149; | |
| --accent-purple: #bc8cff; | |
| --code-bg: #1c2128; | |
| } | |
| * { margin: 0; padding: 0; box-sizing: border-box; } | |
| body { | |
| font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; | |
| background: var(--bg); | |
| color: var(--text); | |
| line-height: 1.6; | |
| padding: 2rem; | |
| max-width: 1100px; | |
| margin: 0 auto; | |
| } | |
| h1 { | |
| font-size: 2.2rem; | |
| margin-bottom: 0.25rem; | |
| background: linear-gradient(135deg, var(--accent), var(--accent-purple)); | |
| -webkit-background-clip: text; | |
| -webkit-text-fill-color: transparent; | |
| } | |
| .subtitle { | |
| color: var(--text-muted); | |
| font-size: 1.1rem; | |
| margin-bottom: 2rem; | |
| border-bottom: 1px solid var(--border); | |
| padding-bottom: 1.5rem; | |
| } | |
| h2 { | |
| font-size: 1.5rem; | |
| margin: 2.5rem 0 1rem 0; | |
| color: var(--accent); | |
| display: flex; | |
| align-items: center; | |
| gap: 0.5rem; | |
| flex-wrap: wrap; | |
| } | |
| h2 .badge { | |
| font-size: 0.7rem; | |
| padding: 2px 8px; | |
| border-radius: 12px; | |
| font-weight: 600; | |
| text-transform: uppercase; | |
| letter-spacing: 0.5px; | |
| } | |
| .badge-green { background: rgba(63,185,80,0.15); color: var(--accent-green); } | |
| .badge-orange { background: rgba(210,153,34,0.15); color: var(--accent-orange); } | |
| .badge-purple { background: rgba(188,140,255,0.15); color: var(--accent-purple); } | |
| .badge-red { background: rgba(248,81,73,0.15); color: var(--accent-red); } | |
| .badge-new { background: rgba(248,81,73,0.2); color: var(--accent-red); font-weight: 800; } | |
| h3 { | |
| font-size: 1.15rem; | |
| margin: 1.5rem 0 0.75rem 0; | |
| color: var(--text); | |
| } | |
| p, li { color: var(--text); margin-bottom: 0.5rem; } | |
| .card { | |
| background: var(--surface); | |
| border: 1px solid var(--border); | |
| border-radius: 8px; | |
| padding: 1.5rem; | |
| margin: 1rem 0; | |
| } | |
| .card-highlight { border-left: 3px solid var(--accent); } | |
| .card-green { border-left: 3px solid var(--accent-green); } | |
| .card-orange { border-left: 3px solid var(--accent-orange); } | |
| .card-red { border-left: 3px solid var(--accent-red); } | |
| table { | |
| width: 100%; | |
| border-collapse: collapse; | |
| margin: 1rem 0; | |
| font-size: 0.9rem; | |
| } | |
| th { | |
| background: var(--code-bg); | |
| color: var(--accent); | |
| text-align: left; | |
| padding: 10px 14px; | |
| border: 1px solid var(--border); | |
| font-weight: 600; | |
| } | |
| td { | |
| padding: 10px 14px; | |
| border: 1px solid var(--border); | |
| vertical-align: top; | |
| } | |
| tr:nth-child(even) td { background: rgba(22,27,34,0.5); } | |
| code { | |
| background: var(--code-bg); | |
| padding: 2px 6px; | |
| border-radius: 4px; | |
| font-family: 'SF Mono', 'Fira Code', monospace; | |
| font-size: 0.85em; | |
| color: var(--accent-orange); | |
| } | |
| pre { | |
| background: var(--code-bg); | |
| border: 1px solid var(--border); | |
| border-radius: 8px; | |
| padding: 1.25rem; | |
| overflow-x: auto; | |
| font-family: 'SF Mono', 'Fira Code', monospace; | |
| font-size: 0.85rem; | |
| line-height: 1.5; | |
| margin: 1rem 0; | |
| color: var(--text-muted); | |
| } | |
| .architecture-diagram { | |
| background: var(--code-bg); | |
| border: 1px solid var(--border); | |
| border-radius: 8px; | |
| padding: 1.5rem; | |
| margin: 1.5rem 0; | |
| overflow-x: auto; | |
| text-align: center; | |
| } | |
| .flow-box { | |
| display: inline-block; | |
| background: var(--surface); | |
| border: 2px solid var(--accent); | |
| border-radius: 8px; | |
| padding: 12px 20px; | |
| margin: 6px; | |
| font-weight: 600; | |
| font-size: 0.9rem; | |
| } | |
| .flow-box-secondary { | |
| border-color: var(--accent-purple); | |
| font-size: 0.85rem; | |
| } | |
| .flow-box-new { | |
| border-color: var(--accent-red); | |
| box-shadow: 0 0 12px rgba(248,81,73,0.2); | |
| } | |
| .flow-arrow { | |
| display: block; | |
| color: var(--text-muted); | |
| font-size: 1.5rem; | |
| margin: 4px 0; | |
| } | |
| .flow-row { | |
| display: flex; | |
| justify-content: center; | |
| gap: 12px; | |
| flex-wrap: wrap; | |
| margin: 6px 0; | |
| } | |
| .phase-header { | |
| display: flex; | |
| justify-content: space-between; | |
| align-items: baseline; | |
| flex-wrap: wrap; | |
| gap: 0.5rem; | |
| } | |
| .timeline { | |
| color: var(--text-muted); | |
| font-size: 0.85rem; | |
| font-style: italic; | |
| } | |
| .file-tree { | |
| font-family: 'SF Mono', 'Fira Code', monospace; | |
| font-size: 0.85rem; | |
| line-height: 1.8; | |
| color: var(--text-muted); | |
| } | |
| .file-tree .dir { color: var(--accent); font-weight: 600; } | |
| .file-tree .file { color: var(--text); } | |
| .file-tree .file-new { color: var(--accent-red); font-weight: 600; } | |
| .file-tree .comment { color: var(--text-muted); font-style: italic; } | |
| ul { padding-left: 1.5rem; margin: 0.5rem 0; } | |
| li { margin-bottom: 0.35rem; } | |
| .grid-2 { | |
| display: grid; | |
| grid-template-columns: 1fr 1fr; | |
| gap: 1rem; | |
| } | |
| @media (max-width: 768px) { | |
| .grid-2 { grid-template-columns: 1fr; } | |
| body { padding: 1rem; } | |
| h1 { font-size: 1.6rem; } | |
| } | |
| .cost-tag { | |
| display: inline-block; | |
| background: rgba(63,185,80,0.15); | |
| color: var(--accent-green); | |
| padding: 2px 10px; | |
| border-radius: 12px; | |
| font-size: 0.8rem; | |
| font-weight: 700; | |
| } | |
| .new-tag { | |
| display: inline-block; | |
| background: rgba(248,81,73,0.2); | |
| color: var(--accent-red); | |
| padding: 1px 6px; | |
| border-radius: 4px; | |
| font-size: 0.7rem; | |
| font-weight: 700; | |
| text-transform: uppercase; | |
| letter-spacing: 0.5px; | |
| vertical-align: middle; | |
| margin-left: 4px; | |
| } | |
| .approval-box { | |
| background: linear-gradient(135deg, rgba(88,166,255,0.1), rgba(188,140,255,0.1)); | |
| border: 2px solid var(--accent); | |
| border-radius: 12px; | |
| padding: 2rem; | |
| text-align: center; | |
| margin: 3rem 0 2rem 0; | |
| } | |
| .approval-box h2 { | |
| justify-content: center; | |
| margin-top: 0; | |
| } | |
| .btn { | |
| display: inline-block; | |
| padding: 10px 28px; | |
| border-radius: 8px; | |
| font-weight: 600; | |
| font-size: 1rem; | |
| text-decoration: none; | |
| margin: 0.5rem; | |
| cursor: default; | |
| } | |
| .btn-approve { background: var(--accent-green); color: #000; } | |
| .btn-revise { background: var(--border); color: var(--text); } | |
| .tag-row { display: flex; gap: 0.5rem; flex-wrap: wrap; margin: 0.5rem 0; } | |
| .tag { | |
| font-size: 0.75rem; | |
| padding: 2px 8px; | |
| border-radius: 4px; | |
| background: var(--code-bg); | |
| border: 1px solid var(--border); | |
| color: var(--text-muted); | |
| } | |
| .comparison-better { color: var(--accent-green); font-weight: 600; } | |
| .comparison-same { color: var(--text-muted); } | |
| .comparison-na { color: var(--text-muted); font-style: italic; } | |
| .workflow-box { | |
| background: var(--code-bg); | |
| border: 1px solid var(--border); | |
| border-radius: 8px; | |
| padding: 1rem 1.5rem; | |
| margin: 0.75rem 0; | |
| font-family: 'SF Mono', 'Fira Code', monospace; | |
| font-size: 0.85rem; | |
| color: var(--text-muted); | |
| } | |
| .workflow-box strong { color: var(--text); } | |
| .workflow-box .highlight { color: var(--accent-red); font-weight: 600; } | |
| .context-grid { | |
| display: grid; | |
| grid-template-columns: 1fr 1fr 1fr; | |
| gap: 1rem; | |
| margin: 1rem 0; | |
| } | |
| @media (max-width: 900px) { | |
| .context-grid { grid-template-columns: 1fr; } | |
| } | |
| .context-card { | |
| background: var(--surface); | |
| border: 1px solid var(--border); | |
| border-radius: 8px; | |
| padding: 1.25rem; | |
| } | |
| .context-card h4 { | |
| color: var(--accent); | |
| margin-bottom: 0.5rem; | |
| font-size: 0.95rem; | |
| } | |
| .context-card p { | |
| font-size: 0.85rem; | |
| color: var(--text-muted); | |
| } | |
| .mapping-flow { | |
| display: flex; | |
| align-items: center; | |
| justify-content: center; | |
| gap: 8px; | |
| flex-wrap: wrap; | |
| margin: 1rem 0; | |
| padding: 1rem; | |
| background: var(--code-bg); | |
| border-radius: 8px; | |
| border: 1px solid var(--border); | |
| } | |
| .mapping-flow .step { | |
| background: var(--surface); | |
| border: 1px solid var(--border); | |
| border-radius: 6px; | |
| padding: 8px 14px; | |
| font-size: 0.8rem; | |
| text-align: center; | |
| } | |
| .mapping-flow .arrow { | |
| color: var(--text-muted); | |
| font-size: 1.2rem; | |
| } | |
| .mapping-flow .step-highlight { | |
| border-color: var(--accent-red); | |
| background: rgba(248,81,73,0.08); | |
| } | |
| </style> | |
| </head> | |
| <body> | |
| <h1>pdfOracle</h1> | |
| <p class="subtitle"> | |
| Static PDF → Fillable Form Converter <strong>with Fiserv XP2 / MISMO Variable Mapping</strong><br> | |
| <small>In-House LiquidOffice Replacement — Open Source, Zero Token Cost</small> | |
| </p> | |
| <!-- ───────────────── OVERVIEW ───────────────── --> | |
| <div class="card card-highlight"> | |
| <h3>What This Does</h3> | |
| <p>Replaces the <strong>LiquidOffice Form Designer</strong> workflow: takes static PDFs from banks, auto-detects fillable fields, <strong>maps each field to Fiserv XP2 / MISMO standard variables</strong>, and returns the mapped fillable PDF + audit manifest to the institution.</p> | |
| <div class="tag-row"> | |
| <span class="tag">Field Auto-Detection</span> | |
| <span class="tag">XP2 Lending Mapping</span> | |
| <span class="tag">XP2 MSP Mapping</span> | |
| <span class="tag">MISMO Standard</span> | |
| <span class="tag">Operator Review UI</span> | |
| <span class="tag">Mapping Manifest Export</span> | |
| <span class="tag">100% Format Preservation</span> | |
| <span class="cost-tag">$0 Token Cost</span> | |
| </div> | |
| </div> | |
| <!-- ───────────────── WORKFLOW COMPARISON ───────────────── --> | |
| <h2>Workflow: Current vs. Target</h2> | |
| <div class="workflow-box"> | |
| <strong>Current (LiquidOffice):</strong><br> | |
| Bank sends PDF → LiquidOffice Form Designer → <strong>Manual</strong> field creation → <strong>Manual</strong> variable mapping (XP2/MISMO) → Return fillable PDF | |
| </div> | |
| <div class="workflow-box" style="border-color: var(--accent-green);"> | |
| <strong>Target (pdfOracle):</strong><br> | |
| Bank sends PDF → pdfOracle <span class="highlight">auto-detects</span> fields → <span class="highlight">auto-suggests</span> variable mappings → Operator <strong>reviews/approves</strong> → Export fillable PDF + mapping manifest → Return to bank | |
| </div> | |
| <!-- ───────────────── INDUSTRY CONTEXT ───────────────── --> | |
| <h2>Industry Context <span class="badge badge-new">Updated</span></h2> | |
| <div class="context-grid"> | |
| <div class="context-card"> | |
| <h4>Fiserv XP2</h4> | |
| <p>Credit union core banking platform (.NET + DB2). Field specs in WSDL (SOAP/XML) — <strong>NDA-protected</strong>, provided to licensed customers. Your org should have the XP2 XML guide with all data elements.</p> | |
| </div> | |
| <div class="context-card"> | |
| <h4>MISMO Standard</h4> | |
| <p>Canonical industry standard for mortgage/lending field naming. <strong>UpperCamelCase</strong> convention: <code>BorrowerFullName</code>, <code>LoanAmount</code>. Suffixes: <code>Indicator</code>=bool, <code>Amount</code>=money, <code>Date</code>=date, <code>Type</code>=enum.</p> | |
| </div> | |
| <div class="context-card"> | |
| <h4>LiquidOffice (Replacing)</h4> | |
| <p>OpenText product, still active (v25.2) but aging UI. ~£36K/license. We replace the <strong>form design + variable mapping</strong> workflow only — NOT the BPM/routing.</p> | |
| </div> | |
| </div> | |
| <!-- ───────────────── ARCHITECTURE ───────────────── --> | |
| <h2>Architecture <span class="badge badge-new">Updated</span></h2> | |
| <div class="architecture-diagram"> | |
| <div class="flow-box">Bank Sends Static PDF</div> | |
| <span class="flow-arrow">↓</span> | |
| <div class="flow-box" style="border-color: var(--accent-green);">1. PDF Parser (PyMuPDF)<br><small style="color:var(--text-muted)">Text spans, lines, rects, fonts, coordinates</small></div> | |
| <span class="flow-arrow">↓</span> | |
| <div class="flow-row"> | |
| <div class="flow-box flow-box-secondary">Underline<br>Detector</div> | |
| <div class="flow-box flow-box-secondary">Box / Rect<br>Detector</div> | |
| <div class="flow-box flow-box-secondary">Placeholder<br>Text Detector</div> | |
| <div class="flow-box flow-box-secondary">Table Grid<br>Detector</div> | |
| </div> | |
| <span class="flow-arrow">↓</span> | |
| <div class="flow-box" style="border-color: var(--accent-orange);">3. Field Classifier + Label Associator<br><small style="color:var(--text-muted)">Type: text | checkbox | date | signature | number</small></div> | |
| <span class="flow-arrow">↓</span> | |
| <div class="flow-box flow-box-new"> | |
| 4. Variable Mapping Engine<span class="new-tag">NEW</span><br> | |
| <small style="color:var(--text-muted)"> | |
| "Applicant Name" → <code style="font-size:0.8em">BorrowerFullName</code> (MISMO)<br> | |
| "Date" → <code style="font-size:0.8em">ApplicationReceivedDate</code> (MISMO)<br> | |
| "Rs.___" → <code style="font-size:0.8em">DepositAmount</code> (XP2)<br> | |
| Registry + fuzzy matching — no AI tokens | |
| </small> | |
| </div> | |
| <span class="flow-arrow">↓</span> | |
| <div class="flow-box flow-box-new"> | |
| 5. Operator Review<span class="new-tag">NEW</span><br> | |
| <small style="color:var(--text-muted)">Approve / edit each mapping • Confidence color-coding</small> | |
| </div> | |
| <span class="flow-arrow">↓</span> | |
| <div class="flow-box" style="border-color: var(--accent-green);">6. AcroForm Widget Writer (PyMuPDF)<br><small style="color:var(--text-muted)">field_name = mapped variable • Annotation layer only</small></div> | |
| <span class="flow-arrow">↓</span> | |
| <div class="flow-row"> | |
| <div class="flow-box">Fillable PDF<br><small style="color:var(--text-muted)">Original format preserved<br>Fields named with spec vars</small></div> | |
| <div class="flow-box flow-box-new">Mapping Manifest<span class="new-tag">NEW</span><br><small style="color:var(--text-muted)">JSON/CSV audit trail<br>field → variable → spec</small></div> | |
| </div> | |
| </div> | |
| <!-- ───────────────── MAPPING ENGINE ───────────────── --> | |
| <h2>Variable Mapping Engine <span class="badge badge-red">Core Differentiator</span></h2> | |
| <p>The key value-add isn't just "make fillable" — it's the <strong>field-to-variable mapping layer</strong> against financial specs. No AI required.</p> | |
| <div class="mapping-flow"> | |
| <div class="step">PDF Label<br><code>"please insert name"</code></div> | |
| <span class="arrow">→</span> | |
| <div class="step">Normalize<br><code>"applicant name"</code></div> | |
| <span class="arrow">→</span> | |
| <div class="step step-highlight">Fuzzy Match<br>against registry</div> | |
| <span class="arrow">→</span> | |
| <div class="step">Spec Variable<br><code>BorrowerFullName</code></div> | |
| <span class="arrow">→</span> | |
| <div class="step">Operator<br>Review</div> | |
| </div> | |
| <div class="card card-red"> | |
| <h3>How Mapping Works (Zero AI)</h3> | |
| <table> | |
| <tr><th>Step</th><th>Method</th><th>Example</th></tr> | |
| <tr><td>1. Exact match</td><td>Label = known alias verbatim</td><td>"borrower name" → <code>BorrowerFullName</code></td></tr> | |
| <tr><td>2. Alias match</td><td>Normalized label matches alias</td><td>"Name of Applicant" → <code>BorrowerFullName</code></td></tr> | |
| <tr><td>3. Fuzzy match</td><td><code>rapidfuzz</code> token_sort_ratio >80%</td><td>"Applicant's Full Name" → <code>BorrowerFullName</code></td></tr> | |
| <tr><td>4. Keyword + type</td><td>Label keywords + field type</td><td>"date" + type=date → <code>ApplicationReceivedDate</code></td></tr> | |
| <tr><td>5. Context clue</td><td>Surrounding page text</td><td>"business of ___" → <code>NatureOfBusinessDescription</code></td></tr> | |
| </table> | |
| </div> | |
| <h3>Mapping Registry Format</h3> | |
| <pre> | |
| # mappings/mismo_lending.yaml | |
| spec: MISMO | |
| version: "3.4" | |
| mappings: | |
| - variable: BorrowerFullName | |
| type: text | |
| container: "PARTY > INDIVIDUAL > NAME" | |
| aliases: | |
| - "applicant name" | |
| - "borrower name" | |
| - "please insert name" | |
| - "full name" | |
| - variable: ApplicationReceivedDate | |
| type: date | |
| container: "LOAN > LOAN_DETAIL" | |
| aliases: | |
| - "date" | |
| - "application date" | |
| - "please insert date" | |
| - variable: BusinessName | |
| type: text | |
| container: "PARTY > LEGAL_ENTITY" | |
| aliases: | |
| - "business name" | |
| - "company name" | |
| - "M/s" | |
| - variable: DepositAmount | |
| type: number | |
| container: "ASSET > ASSET_DETAIL" | |
| aliases: | |
| - "depositing a sum" | |
| - "deposit amount" | |
| - "Rs." | |
| </pre> | |
| <h3>What You Need to Provide</h3> | |
| <div class="card card-orange"> | |
| <table> | |
| <tr><th>Item</th><th>Why</th><th>Format</th></tr> | |
| <tr><td>XP2 WSDL file</td><td>Extract all XP2 field names and types</td><td>XML</td></tr> | |
| <tr><td>XP2 XML guide</td><td>Field descriptions and business rules</td><td>PDF/doc</td></tr> | |
| <tr><td>MISMO LDD workbook (if available)</td><td>Standard field dictionary</td><td>Excel</td></tr> | |
| <tr><td>Existing LiquidOffice form exports</td><td>Seed registry with proven label→variable pairs</td><td>XML/CSV</td></tr> | |
| <tr><td>Any existing mapping spreadsheets</td><td>Common mappings your team already uses</td><td>Excel/CSV</td></tr> | |
| </table> | |
| </div> | |
| <!-- ───────────────── TECH STACK ───────────────── --> | |
| <h2>Tech Stack</h2> | |
| <table> | |
| <tr><th>Component</th><th>Choice</th><th>License</th><th>Why</th></tr> | |
| <tr><td>Language</td><td><code>Python 3.11+</code></td><td>—</td><td>Best PDF library ecosystem</td></tr> | |
| <tr><td>PDF Read/Write</td><td><code>PyMuPDF (fitz)</code></td><td>AGPL</td><td>Single lib for extraction + widget writing</td></tr> | |
| <tr><td>Fuzzy Matching</td><td><code>rapidfuzz</code><span class="new-tag">NEW</span></td><td>MIT</td><td>Label → variable fuzzy matching</td></tr> | |
| <tr><td>Mapping Registry</td><td><code>YAML + PyYAML</code><span class="new-tag">NEW</span></td><td>MIT</td><td>Human-editable, versionable mapping files</td></tr> | |
| <tr><td>ML Detection</td><td><code>CommonForms/FFDNet</code></td><td>MIT</td><td>YOLO11, outperforms Adobe, runs locally</td></tr> | |
| <tr><td>Web Framework</td><td><code>FastAPI</code></td><td>MIT</td><td>Async file upload, auto API docs</td></tr> | |
| <tr><td>Frontend</td><td><code>Vanilla HTML/JS + pdf.js</code></td><td>Apache 2.0</td><td>PDF preview + mapping review UI</td></tr> | |
| <tr><td>WSDL Parser</td><td><code>lxml</code><span class="new-tag">NEW</span></td><td>BSD</td><td>Import XP2 field specs from WSDL</td></tr> | |
| <tr><td>Packaging</td><td><code>Docker</code></td><td>—</td><td>Reproducible deployment</td></tr> | |
| </table> | |
| <!-- ───────────────── PHASES ───────────────── --> | |
| <!-- PHASE 1 --> | |
| <h2>Phase 1 — Core Engine + Variable Mapping <span class="badge badge-green">No AI</span> <span class="badge badge-new">Updated</span></h2> | |
| <div class="phase-header"> | |
| <p><strong>Goal:</strong> CLI + library that detects fields AND maps to spec variables</p> | |
| <span class="timeline">Week 1–2</span> | |
| </div> | |
| <div class="card"> | |
| <h3>Project Structure</h3> | |
| <div class="file-tree"> | |
| <span class="dir">pdfOracle/</span><br> | |
| <span class="file">pyproject.toml</span> <span class="file">Dockerfile</span><br> | |
| <span class="dir">src/pdforacle/</span><br> | |
| <span class="file">cli.py</span> <span class="comment"># CLI entry point</span><br> | |
| <span class="file">parser.py</span> <span class="comment"># PDF extraction</span><br> | |
| <span class="file">detector.py</span> <span class="comment"># Field detection rules</span><br> | |
| <span class="file">classifier.py</span> <span class="comment"># Field type classification</span><br> | |
| <span class="file">labeler.py</span> <span class="comment"># Label-to-field association</span><br> | |
| <span class="file-new">mapper.py</span> <span class="comment"># ★ Variable mapping engine</span><br> | |
| <span class="file-new">exporter.py</span> <span class="comment"># ★ Mapping manifest export</span><br> | |
| <span class="file">writer.py</span> <span class="comment"># AcroForm widget writer</span><br> | |
| <span class="file">models.py</span> <span class="comment"># Data classes</span><br> | |
| <span class="dir" style="color:var(--accent-red)">mappings/</span> <span class="comment"># ★ Variable mapping registries</span><br> | |
| <span class="file-new">mismo_lending.yaml</span> <span class="comment"># MISMO standard</span><br> | |
| <span class="file-new">mismo_servicing.yaml</span><br> | |
| <span class="file-new">xp2_lending.yaml</span> <span class="comment"># From your WSDL</span><br> | |
| <span class="file-new">xp2_msp.yaml</span><br> | |
| <span class="dir" style="color:var(--accent-red)">custom/</span> <span class="comment"># Per-institution overrides</span><br> | |
| <span class="dir">tests/</span><br> | |
| <span class="file">test_parser.py</span> / <span class="file">test_detector.py</span> / <span class="file-new">test_mapper.py</span> / <span class="file">test_writer.py</span><br> | |
| </div> | |
| </div> | |
| <h3>Detection Rules (unchanged)</h3> | |
| <table> | |
| <tr><th>Visual Pattern</th><th>Detection Rule</th><th>Field Type</th></tr> | |
| <tr><td>Horizontal line near text</td><td>Line width >50pt, height <2pt</td><td>Text Input</td></tr> | |
| <tr><td>Consecutive <code>______</code> chars</td><td>3+ underscore characters</td><td>Text Input</td></tr> | |
| <tr><td>Empty rectangle</td><td>No fill, aspect >3:1</td><td>Text Input</td></tr> | |
| <tr><td>Small square (8–14pt)</td><td>Aspect ~1:1, width 8–14pt</td><td>Checkbox</td></tr> | |
| <tr><td>Table grid cells</td><td>Line intersections</td><td>Text Inputs</td></tr> | |
| <tr><td><code>(Please insert...)</code></td><td>Parenthetical hint</td><td>Label hint</td></tr> | |
| </table> | |
| <h3>CLI Usage <span class="badge badge-new">Updated</span></h3> | |
| <pre> | |
| # Detect + map + convert (MISMO spec) | |
| pdforacle convert input.pdf -o output.pdf --spec mismo | |
| # Preview detected fields + proposed mappings | |
| pdforacle convert input.pdf --preview --spec xp2_lending | |
| # Export mapping manifest only | |
| pdforacle map input.pdf --spec mismo --format json -o manifest.json | |
| # Use custom institution mappings | |
| pdforacle convert input.pdf --spec custom/example_bank | |
| # List available specs | |
| pdforacle specs list | |
| </pre> | |
| <h3>Mapping Manifest Output <span class="new-tag">NEW</span></h3> | |
| <pre> | |
| { | |
| "source_pdf": "LCC_Account_Opening_Application.pdf", | |
| "spec": "mismo", | |
| "spec_version": "3.4", | |
| "mapped_fields": [ | |
| { | |
| "pdf_field_name": "BorrowerFullName", | |
| "label": "please insert name", | |
| "variable": "BorrowerFullName", | |
| "container": "PARTY > INDIVIDUAL > NAME", | |
| "field_type": "text", | |
| "page": 1, | |
| "mapping_confidence": 0.92 | |
| } | |
| ], | |
| "mapping_stats": { | |
| "total_fields": 8, | |
| "auto_mapped": 6, | |
| "needs_review": 2 | |
| } | |
| } | |
| </pre> | |
| <!-- PHASE 2 --> | |
| <h2>Phase 2 — ML-Assisted Detection <span class="badge badge-orange">Local Model • Still $0</span></h2> | |
| <div class="phase-header"> | |
| <p><strong>Goal:</strong> Boost detection accuracy on diverse/messy forms</p> | |
| <span class="timeline">Week 3</span> | |
| </div> | |
| <div class="card card-orange"> | |
| <ul> | |
| <li><code>ml_detector.py</code> — wraps CommonForms/FFDNet (YOLO11)</li> | |
| <li><code>ensemble.py</code> — combines heuristic + ML, snaps to PDF primitives</li> | |
| <li>Runs <strong>100% locally</strong> (CPU ~5-10s/page, GPU <1s)</li> | |
| </ul> | |
| <table> | |
| <tr><th>Scenario</th><th>Confidence</th><th>Action</th></tr> | |
| <tr><td>Heuristic + ML agree</td><td style="color:var(--accent-green)">High</td><td>Use directly</td></tr> | |
| <tr><td>Only one detects</td><td style="color:var(--accent-orange)">Medium</td><td>Use, flag for review</td></tr> | |
| <tr><td>They disagree</td><td style="color:var(--accent-red)">Low</td><td>Flag for manual review</td></tr> | |
| </table> | |
| </div> | |
| <!-- PHASE 3 --> | |
| <h2>Phase 3 — Web UI + Mapping Review <span class="badge badge-purple">Operator Workflow</span> <span class="badge badge-new">Updated</span></h2> | |
| <div class="phase-header"> | |
| <p><strong>Goal:</strong> Browser-based interface with mapping review/approval workflow</p> | |
| <span class="timeline">Week 4</span> | |
| </div> | |
| <div class="card"> | |
| <h3>API Endpoints</h3> | |
| <table> | |
| <tr><th>Method</th><th>Endpoint</th><th>Description</th></tr> | |
| <tr><td><code>POST</code></td><td>/api/upload</td><td>Upload PDF, returns job ID</td></tr> | |
| <tr><td><code>GET</code></td><td>/api/detect/{id}</td><td>Detected fields as JSON</td></tr> | |
| <tr><td><code>GET</code></td><td>/api/mappings/{id}</td><td><span class="new-tag">NEW</span> Proposed variable mappings</td></tr> | |
| <tr><td><code>PUT</code></td><td>/api/mappings/{id}</td><td><span class="new-tag">NEW</span> Operator approves/edits mappings</td></tr> | |
| <tr><td><code>POST</code></td><td>/api/convert/{id}</td><td>Generate fillable PDF with mapped vars</td></tr> | |
| <tr><td><code>GET</code></td><td>/api/download/{id}</td><td>Download fillable PDF</td></tr> | |
| <tr><td><code>GET</code></td><td>/api/manifest/{id}</td><td><span class="new-tag">NEW</span> Download mapping manifest</td></tr> | |
| <tr><td><code>GET</code></td><td>/api/specs</td><td><span class="new-tag">NEW</span> List available mapping specs</td></tr> | |
| </table> | |
| <h3>Mapping Review UI <span class="new-tag">NEW</span></h3> | |
| <ul> | |
| <li>PDF preview with detected fields highlighted</li> | |
| <li><strong>Mapping review panel</strong> for each field: | |
| <ul> | |
| <li>Field label (from PDF) → Proposed variable (from spec)</li> | |
| <li>Confidence score (color: green / yellow / red)</li> | |
| <li>Dropdown to select alternative mappings</li> | |
| <li>Search box to find any spec variable manually</li> | |
| </ul> | |
| </li> | |
| <li><strong>Spec selector:</strong> MISMO, XP2 Lending, XP2 MSP, custom</li> | |
| <li>Approve all / approve individually</li> | |
| <li>Download fillable PDF + mapping manifest</li> | |
| <li>Corrections saved back to registry for future improvement</li> | |
| </ul> | |
| </div> | |
| <!-- PHASE 4 --> | |
| <h2>Phase 4 — Batch Processing + Learning <span class="badge badge-purple">Scale</span> <span class="badge badge-new">Updated</span></h2> | |
| <div class="phase-header"> | |
| <p><strong>Goal:</strong> Process multiple PDFs, learn from operator corrections</p> | |
| <span class="timeline">Week 5</span> | |
| </div> | |
| <div class="card"> | |
| <ul> | |
| <li>Batch upload (<code>POST /api/batch</code>)</li> | |
| <li>Async queue + webhook callbacks</li> | |
| <li>API key auth for external integrations</li> | |
| <li><span class="new-tag">NEW</span> <strong>Mapping memory:</strong> operator corrections feed back into registry</li> | |
| <li><span class="new-tag">NEW</span> <strong>Institution profiles:</strong> per-bank mapping preferences</li> | |
| <li>Processing stats + mapping accuracy dashboard</li> | |
| </ul> | |
| </div> | |
| <!-- ───────────────── DECISIONS ───────────────── --> | |
| <h2>Key Design Decisions</h2> | |
| <table> | |
| <tr><th>Decision</th><th>Choice</th><th>Rationale</th></tr> | |
| <tr> | |
| <td>Field names = mapped variables</td> | |
| <td><span class="new-tag">NEW</span> Yes</td> | |
| <td>AcroForm field name IS the spec variable (e.g., <code>BorrowerFullName</code>). This is what the bank consumes.</td> | |
| </tr> | |
| <tr> | |
| <td>Registry-based mapping (no AI)</td> | |
| <td><span class="new-tag">NEW</span> Yes</td> | |
| <td>Fuzzy matching against curated registry is deterministic, auditable, and $0. AI is overkill here.</td> | |
| </tr> | |
| <tr> | |
| <td>Multi-spec support</td> | |
| <td><span class="new-tag">NEW</span> Yes</td> | |
| <td>Same form can be mapped against MISMO, XP2 Lending, XP2 MSP, or custom specs.</td> | |
| </tr> | |
| <tr> | |
| <td>Operator-in-the-loop</td> | |
| <td><span class="new-tag">NEW</span> Yes</td> | |
| <td>Auto-mapping suggests; human approves. Critical for financial forms where accuracy is non-negotiable.</td> | |
| </tr> | |
| <tr> | |
| <td>Mapping manifest</td> | |
| <td><span class="new-tag">NEW</span> Yes</td> | |
| <td>JSON/CSV export of every field→variable mapping for audit trail and regulatory compliance.</td> | |
| </tr> | |
| <tr> | |
| <td>Format preservation</td> | |
| <td>Annotation layer only</td> | |
| <td>Original PDF content never modified. 100% format preservation.</td> | |
| </tr> | |
| <tr> | |
| <td>LiquidOffice scope</td> | |
| <td>Partial replacement</td> | |
| <td>Replace form design + mapping workflow. NOT replacing BPM/routing.</td> | |
| </tr> | |
| </table> | |
| <!-- ───────────────── COMPARISON ───────────────── --> | |
| <h2>LiquidOffice vs. pdfOracle</h2> | |
| <table> | |
| <tr><th>Capability</th><th>LiquidOffice</th><th>pdfOracle</th></tr> | |
| <tr><td>Auto-detect form fields</td><td>No (manual)</td><td class="comparison-better">Yes (heuristic + ML)</td></tr> | |
| <tr><td>Variable mapping</td><td>Manual only</td><td class="comparison-better">Auto-suggest + review</td></tr> | |
| <tr><td>Multi-spec support</td><td>Limited</td><td class="comparison-better">MISMO, XP2, custom YAML</td></tr> | |
| <tr><td>Mapping audit trail</td><td>Limited</td><td class="comparison-better">JSON/CSV/XLSX manifest</td></tr> | |
| <tr><td>Format preservation</td><td>Yes</td><td class="comparison-same">Yes (annotation layer)</td></tr> | |
| <tr><td>Batch processing</td><td>No</td><td class="comparison-better">Yes (Phase 4)</td></tr> | |
| <tr><td>Learning from corrections</td><td>No</td><td class="comparison-better">Yes (registry feedback)</td></tr> | |
| <tr><td>Per-institution profiles</td><td>No</td><td class="comparison-better">Yes (Phase 4)</td></tr> | |
| <tr><td>License cost</td><td>~£36K/license</td><td class="comparison-better">$0 (open source)</td></tr> | |
| <tr><td>Token / API cost</td><td>N/A</td><td class="comparison-better">$0</td></tr> | |
| <tr><td>BPM / Workflow routing</td><td>Yes</td><td class="comparison-na">Out of scope</td></tr> | |
| <tr><td>Web form publishing</td><td>Yes</td><td class="comparison-na">Out of scope</td></tr> | |
| </table> | |
| <!-- ───────────────── METRICS ───────────────── --> | |
| <h2>Success Criteria</h2> | |
| <div class="grid-2"> | |
| <div class="card card-green"> | |
| <h3>Performance Targets</h3> | |
| <table> | |
| <tr><td>Format preservation</td><td><strong>100%</strong></td></tr> | |
| <tr><td>Field detection (structured)</td><td>>85% recall, >90% precision</td></tr> | |
| <tr><td>Field detection (with ML)</td><td>>90% recall, >90% precision</td></tr> | |
| <tr><td>Variable mapping (seeded)</td><td>>70% auto-mapped at >80% confidence</td></tr> | |
| <tr><td>Variable mapping (after feedback)</td><td>>90% auto-mapped</td></tr> | |
| <tr><td>Speed</td><td><2s / page (CPU)</td></tr> | |
| <tr><td>Token cost</td><td><strong>$0</strong></td></tr> | |
| </table> | |
| </div> | |
| <div class="card"> | |
| <h3>Dependencies</h3> | |
| <pre style="margin:0;border:none;padding:0.5rem;"> | |
| pymupdf >= 1.24.0 # PDF engine | |
| click >= 8.0 # CLI | |
| pydantic >= 2.0 # Data models | |
| rapidfuzz >= 3.0 # Fuzzy matching ★ | |
| pyyaml >= 6.0 # Mapping registry ★ | |
| # Phase 2 (optional) | |
| commonforms >= 0.1 # ML detection | |
| torch >= 2.0 # PyTorch | |
| # Phase 3 (optional) | |
| fastapi >= 0.110 # Web server | |
| uvicorn >= 0.29 # ASGI | |
| # Registry builders | |
| openpyxl >= 3.1 # MISMO LDD Excel ★ | |
| lxml >= 5.0 # XP2 WSDL XML ★</pre> | |
| </div> | |
| </div> | |
| <!-- ───────────────── RISKS ───────────────── --> | |
| <h2>Risks & Mitigations</h2> | |
| <table> | |
| <tr><th>Risk</th><th>Mitigation</th></tr> | |
| <tr><td>XP2 specs are NDA-protected</td><td>Tool imports from WSDL your org already has. We don't distribute specs.</td></tr> | |
| <tr><td>MISMO LDD requires membership</td><td>Start with Fannie Mae/Freddie Mac public ULDD samples. Upgrade later.</td></tr> | |
| <tr><td>Low mapping accuracy on first run</td><td>Seed from existing LiquidOffice exports. Operator review catches gaps.</td></tr> | |
| <tr><td>PyMuPDF AGPL license</td><td>Fine for internal tools. Swap to pikepdf+ReportLab if distributing.</td></tr> | |
| <tr><td>Scanned PDFs</td><td>Optional Tesseract OCR preprocessing.</td></tr> | |
| <tr><td>Regulatory audit requirements</td><td>Mapping manifest provides full audit trail.</td></tr> | |
| </table> | |
| <!-- ───────────────── SCOPE ───────────────── --> | |
| <h2>Out of Scope</h2> | |
| <div class="card"> | |
| <ul> | |
| <li>Does NOT modify or reflow original PDF content</li> | |
| <li>Does NOT replace BPM/workflow routing (keep existing tool for that)</li> | |
| <li>Does NOT distribute Fiserv NDA-protected specs (imports from your copy)</li> | |
| <li>Does NOT handle scanned PDFs (without optional OCR add-on)</li> | |
| <li>Does NOT require internet or API keys — fully offline capable</li> | |
| </ul> | |
| </div> | |
| <!-- ───────────────── APPROVAL ───────────────── --> | |
| <div class="approval-box"> | |
| <h2>Ready to Build?</h2> | |
| <p style="color: var(--text-muted); margin-bottom: 1rem;"> | |
| Plan updated with Fiserv XP2 / MISMO variable mapping and LiquidOffice replacement context.<br> | |
| Phase 1 builds the core engine + mapping. Awaiting your approval. | |
| </p> | |
| <span class="btn btn-approve">Approve & Build Phase 1</span> | |
| <span class="btn btn-revise">Request Changes</span> | |
| </div> | |
| <p style="text-align:center; color:var(--text-muted); font-size:0.8rem; margin-top: 2rem;"> | |
| pdfOracle — Updated Plan v2 — 2026-03-19 | |
| </p> | |
| </body> | |
| </html> |
Comments are disabled for this gist.