This gist contains a Python tool for converting flat HTML documents—especially those with nested content indicated only by header levels (like <h1>
, <h2>
, etc.)—into a structured JSON object that reflects the implicit hierarchy.
Originally inspired by this StackOverflow question from @psychicesp, this script is useful for:
- Web scraping content-heavy pages (like menus, outlines, legal docs)
- Explicitly inferring structure from heading levels
- Preserving content hierarchy when working with markdown-to-HTML output