Skip to content

Instantly share code, notes, and snippets.

@Und3rf10w
Last active September 26, 2025 05:30
Show Gist options
  • Save Und3rf10w/1e78e7e3f44d3012bb2ffe5b889f9c89 to your computer and use it in GitHub Desktop.
Save Und3rf10w/1e78e7e3f44d3012bb2ffe5b889f9c89 to your computer and use it in GitHub Desktop.
Stress test for your markdown parser. If you have code that works for this, please share
/**
* This is my attempt at a solution. Send input text to ``parseInput``
* - Und3rf10w
*/
export interface ParsedItem {
type: 'code' | 'text';
content: string;
language?: string;
metadata?: {
fileName?: string;
highlightedLines?: string;
};
}
/**
* A state-machine-based parser for extracting text and code blocks.
* Handles nested code blocks and edge cases with fence lengths
*/
export const parseInput = (inputText: string): ParsedItem[] => {
if (!inputText) return [];
const result: ParsedItem[] = [];
const lines = inputText.split('\n');
let inCodeBlock = false;
let currentBlockLines: string[] = [];
let currentInfoString = '';
let currentOpenFenceLength = 0;
for (let i = 0; i < lines.length; i++) {
const line = lines[i];
if (inCodeBlock) {
// We are INSIDE a code block, looking for a closing fence
const trimmedLine = line.trim();
const fenceMatch = trimmedLine.match(/^(`{3,})$/);
if (fenceMatch) {
const fenceLength = fenceMatch[0].length;
// Special case: if this fence equals the opening length,
// look ahead for a longer fence (skipping blank lines)
if (fenceLength === currentOpenFenceLength) {
let foundLongerFence = false;
let lookAheadIndex = i + 1;
let blankLinesCount = 0;
const maxBlankLines = 3; // a reasonable limit for new lines between fences
while (
lookAheadIndex < lines.length &&
blankLinesCount <= maxBlankLines
) {
const lookAheadLine = lines[lookAheadIndex].trim();
if (lookAheadLine === '') {
blankLinesCount++;
lookAheadIndex++;
continue;
}
const lookAheadFenceMatch = lookAheadLine.match(/^(`{3,})$/);
if (
lookAheadFenceMatch &&
lookAheadFenceMatch[0].length > fenceLength
) {
foundLongerFence = true;
break;
}
// If we hit non-blank, non-fence content, stop looking
break;
}
if (foundLongerFence) {
// The current fence is content, not a closer
currentBlockLines.push(line);
continue;
} else {
// No longer fence found, this IS the closing fence
const codeContent = currentBlockLines.join('\n');
const { language, metadata } = parseInfoString(currentInfoString);
result.push({
type: 'code',
content: codeContent,
language,
metadata,
});
// Reset state
inCodeBlock = false;
currentBlockLines = [];
currentInfoString = '';
currentOpenFenceLength = 0;
continue;
}
}
// Handle fences that are longer than the opener
if (fenceLength > currentOpenFenceLength) {
// Exit the code block
const codeContent = currentBlockLines.join('\n');
const { language, metadata } = parseInfoString(currentInfoString);
result.push({
type: 'code',
content: codeContent,
language,
metadata,
});
// Reset state
inCodeBlock = false;
currentBlockLines = [];
currentInfoString = '';
currentOpenFenceLength = 0;
} else {
// Fence is too short to close this block
currentBlockLines.push(line);
}
} else {
currentBlockLines.push(line);
}
} else {
// We are OUTSIDE a code block, looking for an opening fence
const trimmedLine = line.trim();
const fenceMatch = trimmedLine.match(/^(`{3,})/);
if (fenceMatch) {
// Enter a new code block
if (currentBlockLines.length > 0) {
result.push({ type: 'text', content: currentBlockLines.join('\n') });
}
inCodeBlock = true;
currentBlockLines = [];
currentOpenFenceLength = fenceMatch[0].length;
currentInfoString = trimmedLine
.substring(currentOpenFenceLength)
.trim();
} else {
currentBlockLines.push(line);
}
}
}
// Handle any remaining content
if (currentBlockLines.length > 0) {
const remainingContent = currentBlockLines.join('\n');
if (inCodeBlock) {
const { language, metadata } = parseInfoString(currentInfoString);
result.push({
type: 'code',
content: remainingContent,
language,
metadata,
});
} else {
result.push({ type: 'text', content: remainingContent });
}
}
return result;
};
/**
* Helper function to parse the info string of a code block.
* Supports both quoted and unquoted filenames with spaces.
* Only removes parentheses that look like legacy highlight syntax.
*/
function parseInfoString(infoString: string) {
const raw = infoString || '';
// Extract highlighted lines from braces (preferred syntax)
const braceHighlightMatch = raw.match(/\{([^}]*)\}/);
const highlightedLines = braceHighlightMatch
? braceHighlightMatch[1].trim() || undefined
: undefined;
// Remove all brace blocks from further parsing
let stripped = raw.replace(/\{[^}]*\}/g, ' ').trim();
// Only remove parentheses that look like legacy highlight syntax
// (contain numbers, commas, and dashes only)
// Really this was just for my specific usecase
stripped = stripped
.replace(/\(\s*\d+\s*(?:-\s*\d+)?(?:\s*,\s*\d+\s*(?:-\s*\d+)?)*\s*\)/g, ' ')
.trim();
// Determine language as the first token (if not a key)
let language = 'plaintext';
let remainder = stripped;
const firstTokenMatch = remainder.match(/^([^\s=]+)/);
if (firstTokenMatch) {
const token = firstTokenMatch[1];
const after = remainder.slice(firstTokenMatch[0].length);
const isKeyLike = /^\s*=/.test(after);
if (!isKeyLike) {
language = token.trim() || 'plaintext';
remainder = after.trim();
}
}
let fileName: string | undefined;
// Parse key=value pairs with unquoted value handling
while (remainder) {
remainder = remainder.trim();
// Match key= pattern (case-insensitive keys)
const keyEqMatch = remainder.match(/^([A-Za-z][A-Za-z0-9_-]*)\s*=\s*/i);
if (!keyEqMatch) break;
const keyNorm = keyEqMatch[1].toLowerCase();
remainder = remainder.slice(keyEqMatch[0].length);
let val = '';
// Handle quoted values
if (remainder.startsWith('"') || remainder.startsWith("'")) {
const quote = remainder[0];
let j = 1;
while (j < remainder.length) {
const ch = remainder[j];
if (ch === '\\' && j + 1 < remainder.length) {
val += remainder[j + 1];
j += 2;
continue;
}
if (ch === quote) break;
val += ch;
j++;
}
remainder = remainder.slice(j + 1);
} else {
// Handle unquoted values - capture everything until we see another key= pattern
let j = 0;
let lastNonWhitespace = -1;
while (j < remainder.length) {
const ch = remainder[j];
// Track last non-whitespace position
if (!/\s/.test(ch)) {
lastNonWhitespace = j;
}
// Look ahead for another key=value pair
if (j > 0 && /\s/.test(ch)) {
const ahead = remainder.slice(j).trim();
if (/^[A-Za-z][A-Za-z0-9_-]*\s*=/i.test(ahead)) {
// Found another key, stop here
j = lastNonWhitespace + 1;
break;
}
}
j++;
}
val = remainder.slice(0, j).trim();
remainder = remainder.slice(j);
}
// Support both 'filename' and 'name' (case-insensitive)
if (keyNorm === 'filename' || keyNorm === 'name') {
fileName = val;
}
}
console.debug('parseInfoString result:', {
input: infoString,
language,
metadata: { fileName, highlightedLines },
});
return { language, metadata: { fileName, highlightedLines } };
}

Example 1: Basic Text and Code

A simple mix of text and a code block.

Input String:

This is the introduction.

```js
console.log("Hello, World!");

And this is the conclusion.

**Output:**
```json
[
  {
    "type": "text",
    "content": "This is the introduction.\n"
  },
  {
    "type": "code",
    "content": "console.log(\"Hello, World!\");",
    "language": "js",
    "metadata": {
      "fileName": undefined,
      "highlightedLines": undefined
    }
  },
  {
    "type": "text",
    "content": "\nAnd this is the conclusion."
  }
]
```

#### Example 2: Code Block with Full Metadata

**Input String:**
````markdown
```typescript {1,5-7} fileName="server.ts"
import { createServer } from 'http';

const PORT = 3000;

createServer((req, res) => {
  res.end('Hello!');
}).listen(PORT);
```

Output:

[
  {
    "type": "code",
    "content": "import { createServer } from 'http';\n\nconst PORT = 3000;\n\ncreateServer((req, res) => {\n  res.end('Hello!');\n}).listen(PORT);",
    "language": "typescript",
    "metadata": {
      "fileName": "server.ts",
      "highlightedLines": "1,5-7"
    }
  }
]

Example 3: Unclosed Code Block (Edge Case)

The parser should correctly handle input that ends before a code block is closed.

Input String:

Here is a Python snippet:

```python
def greet(name):
  print(f"Hello, {name}")

Output:

[
  {
    "type": "text",
    "content": "Here is a Python snippet:\n"
  },
  {
    "type": "code",
    "content": "def greet(name):\n  print(f\"Hello, {name}\")",
    "language": "python",
    "metadata": {
      "fileName": undefined,
      "highlightedLines": undefined
    }
  }
]

Example 4: Empty Input

Input String:

''

Output:

[]

Empty line break:

Empty file:

Metadata tests

Example 3 should not have broken the parsing of the rest of this file, but almost always does. The following examples can be used to test cases for additional metadata for parsers that support it, e.g. line highlights ({1,14,23-42} or filename specification (filename="test.py", name=test.py).

Example 5: Highlight

print("hello")

Example 6: multiline highlight

# Testing
print("Hello world")

Example 7: nexted test with break between them

# Testing

```python filename=test.py
print("test")

#### Example 8: Nested text without filename and no break

```markdown Name="nested_test_without_nested_filename.md"
```python
print("test")
```

Example 9: test wiht quote and multiline highlight

# Testing
print("Hello world")

Example 10: Test with parens at end

# This
is
a
test

Example 11: test with parens and multiline highlight

# This

is

a

test

Example 12: Test with multi-multi highlight and spaces in file name and paraens and nested code block

# This

is

a

test

```python
print("goodbye")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment