Created
July 10, 2025 02:54
-
-
Save jSayal/97d88584bd2d34dfa4b2c0fa2410f654 to your computer and use it in GitHub Desktop.
Verbatim PDF to Markdown Conversion using LLM (ChatGPT, Claude etc.)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| After multiple attempts to have a verbatim conversion of PDF to Markdown file, the following prompt worked nicely. | |
| Feel free to add or suggest improvements. | |
| ``` | |
| **Objective:** Convert the attached PDF to Markdown **verbatim** without summarizing, omitting, or altering any content. Follow this exact workflow: | |
| #### **Phase 1: Document Analysis** | |
| 1. **Read Entire PDF** | |
| - Process all pages sequentially. Do **not** skip pages or sections. | |
| - Preserve every paragraph, table row, and code snippet - **no exceptions**. | |
| 2. **Structure Identification** | |
| - Map chapters/sections to heading levels (`#` → `###`). | |
| - Tag special elements with metadata: | |
| ```markdown | |
| <!-- [TABLE] 4-column financial data --> | |
| <!-- [CODE] Unidentified language (lines 12-25) --> | |
| ``` | |
| 3. **Noise Filtering** | |
| - Auto-ignore repeating footers/headers (e.g., "Page 3 of 23"). | |
| - If uncertain, keep content but flag: `<!-- CHECK: Possible footer -->`. | |
| #### **Phase 2: Strict 1:1 Conversion** | |
| 1. **Text Formatting** | |
| - Bold/italics → `**text**`/`*text*` | |
| - Lists: Maintain exact indentation (even if inconsistent in PDF). | |
| 2. **Tables** | |
| - Convert **all rows** - never truncate. | |
| - Use pipe syntax with alignment hints: | |
| ```markdown | |
| | Header 1 | Header 2 | | |
| |----------|----------| | |
| | Row 1 | Data | <!-- Preserve empty cells! --> | |
| ``` | |
| 3. **Code Blocks** | |
| - Minimum 3-line backtick fences with line breaks: | |
| ````markdown | |
| ```python | |
| def example(): # Never join split code lines! | |
| pass | |
| ``` | |
| ```` | |
| - For unidentified languages: | |
| ``` | |
| [UNKNOWN_LANGUAGE] | |
| fn obscure_code() { ... } | |
| ``` | |
| 4. **Images/Figures** | |
| - Placeholder + filename: `` | |
| #### **Phase 3: Validation** | |
| 1. **Anti-Summarization Checks** | |
| - Compare word count of original PDF paragraphs vs. Markdown output. | |
| - If >5% discrepancy, revert and flag: `<!-- LENGTH MISMATCH: p.14 paragraph 2 -->`. | |
| 2. **Ambiguity Protocol** | |
| - For unclear formatting: | |
| ```markdown | |
| <!-- RAW_PDF_EXTRACT_START --> | |
| [Strange spacing] | |
| <!-- RAW_PDF_EXTRACT_END --> | |
| ``` | |
| #### **Phase 4: Delivery** | |
| - Output **raw Markdown only** (no JSON/XML wrappers). | |
| - Include this completion token: `<!-- CONVERSION_COMPLETE_VERBATIM -->`. | |
| **Failure Modes That Void Approval:** | |
| - Any summarized/joined paragraphs | |
| - Truncated tables or code | |
| - Unmarked language guesses | |
| ``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment