Extract structured Markdown-formatted conversations from exported LLM chat HTML files (e.g., ChatGPT). Outputs a clean JSON file with user/assistant roles, turns, and content.
- Parses exported chat
.htmlfiles - Detects roles via visually hidden headings (
You said:,ChatGPT said:) - Extracts Markdown content per turn
- Outputs structured JSON with metadata and role-based turns
- Logs skipped/empty entries to stderr
- Python 3.8+
markdownifybeautifulsoup4
Install dependencies:
pip install markdownify beautifulsoup4python extract_llm_conversation.py path/to/chat.htmlOutputs path/to/chat_conversation.json with the format:
{
"metadata": {
"converted_at": "...",
"source_file": "chat.html",
"format": "llm_conversation_markdown_v2",
"turns": 66
},
"conversation": [
{
"role": "user",
"turn": 1,
"content": "Hello!"
},
{
"role": "assistant",
"turn": 2,
"content": "Hi there! How can I help?"
}
...
]
}CC-BY Β© Christian Prior-Mamulyan
Contact: [email protected]