Skip to content

Instantly share code, notes, and snippets.

@hamelsmu
Last active November 26, 2024 06:15
Show Gist options
  • Save hamelsmu/e7bc44e3f8318847a841e4e103bc1320 to your computer and use it in GitHub Desktop.
Save hamelsmu/e7bc44e3f8318847a841e4e103bc1320 to your computer and use it in GitHub Desktop.
html to markdown
from html2text import HTML2Text
from textwrap import dedent
import re
def get_md(cts, extractor='h2t'):
h2t = HTML2Text(bodywidth=5000)
h2t.ignore_links,h2t.mark_code,h2t.ignore_images = (True,)*3
res = h2t.handle(cts)
def _f(m): return f'```\n{dedent(m.group(1))}\n```'
return re.sub(r'\[code]\s*\n(.*?)\n\[/code]', _f, res or '', flags=re.DOTALL).strip()
@hamelsmu
Copy link
Author

hamelsmu commented Nov 26, 2024

Usage

>>> html="<h1>Mark's World</h1><p>Hello World<p>"
>>> get_md(html)
"# Mark's World\n\nHello World"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment