Skip to content

Instantly share code, notes, and snippets.

@jSayal
Created July 10, 2025 02:54
Show Gist options
  • Select an option

  • Save jSayal/97d88584bd2d34dfa4b2c0fa2410f654 to your computer and use it in GitHub Desktop.

Select an option

Save jSayal/97d88584bd2d34dfa4b2c0fa2410f654 to your computer and use it in GitHub Desktop.
Verbatim PDF to Markdown Conversion using LLM (ChatGPT, Claude etc.)
After multiple attempts to have a verbatim conversion of PDF to Markdown file, the following prompt worked nicely.
Feel free to add or suggest improvements.
```
**Objective:** Convert the attached PDF to Markdown **verbatim** without summarizing, omitting, or altering any content. Follow this exact workflow:
#### **Phase 1: Document Analysis**
1. **Read Entire PDF**
- Process all pages sequentially. Do **not** skip pages or sections.
- Preserve every paragraph, table row, and code snippet - **no exceptions**.
2. **Structure Identification**
- Map chapters/sections to heading levels (`#` → `###`).
- Tag special elements with metadata:
```markdown
<!-- [TABLE] 4-column financial data -->
<!-- [CODE] Unidentified language (lines 12-25) -->
```
3. **Noise Filtering**
- Auto-ignore repeating footers/headers (e.g., "Page 3 of 23").
- If uncertain, keep content but flag: `<!-- CHECK: Possible footer -->`.
#### **Phase 2: Strict 1:1 Conversion**
1. **Text Formatting**
- Bold/italics → `**text**`/`*text*`
- Lists: Maintain exact indentation (even if inconsistent in PDF).
2. **Tables**
- Convert **all rows** - never truncate.
- Use pipe syntax with alignment hints:
```markdown
| Header 1 | Header 2 |
|----------|----------|
| Row 1 | Data | <!-- Preserve empty cells! -->
```
3. **Code Blocks**
- Minimum 3-line backtick fences with line breaks:
````markdown
```python
def example(): # Never join split code lines!
pass
```
````
- For unidentified languages:
```
[UNKNOWN_LANGUAGE]
fn obscure_code() { ... }
```
4. **Images/Figures**
- Placeholder + filename: `![Fig.3: Architecture Diagram](pdf_image_7.png)`
#### **Phase 3: Validation**
1. **Anti-Summarization Checks**
- Compare word count of original PDF paragraphs vs. Markdown output.
- If >5% discrepancy, revert and flag: `<!-- LENGTH MISMATCH: p.14 paragraph 2 -->`.
2. **Ambiguity Protocol**
- For unclear formatting:
```markdown
<!-- RAW_PDF_EXTRACT_START -->
[Strange spacing]
<!-- RAW_PDF_EXTRACT_END -->
```
#### **Phase 4: Delivery**
- Output **raw Markdown only** (no JSON/XML wrappers).
- Include this completion token: `<!-- CONVERSION_COMPLETE_VERBATIM -->`.
**Failure Modes That Void Approval:**
- Any summarized/joined paragraphs
- Truncated tables or code
- Unmarked language guesses
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment