Skip to content

Instantly share code, notes, and snippets.

@mbreese
Last active November 4, 2025 20:24
Show Gist options
  • Select an option

  • Save mbreese/e87e65d1ef99101b22e26761e2e067c1 to your computer and use it in GitHub Desktop.

Select an option

Save mbreese/e87e65d1ef99101b22e26761e2e067c1 to your computer and use it in GitHub Desktop.
Convert a website to Markdown

Many modern websites are dynamically generated/loaded. This means that if you want to download them / scrape them, or use them with an LLM/MCP server, you need to download them with a full web browser. You can't just curl the data and user that HTML. You'll miss all of the content.

So, to do this, you can use this Dockerfile and script to download any website and convert the page to markdown.

FROM debian:13
RUN apt update && \
apt install -y chromium-driver python3-pip python3-venv && \
mkdir /app && \
cd /app && \
python3 -m venv venv && \
venv/bin/pip install selenium markdownify && \
useradd user
USER user
COPY web_fetch.py /app
CMD /app/venv/bin/python3 /app/web_fetch.py
#!/app/venv/bin/python3
import sys
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from markdownify import markdownify as md
# Path to your ChromeDriver executable
service = Service('/usr/bin/chromedriver')
options = Options()
options.add_argument("--no-sandbox")
options.add_argument('--disable-dev-shm-usage') # needed to allow non-root exec
options.add_argument("--headless=new")
driver = webdriver.Chrome(service=service, options=options)
url = sys.argv[1]
if url[:7] != 'http://' and url[:8] != 'https://':
url = f'http://{url}'
driver.get(url)
# convert hrefs to absolute paths
driver.execute_script("""
const links = document.body.querySelectorAll('a');
links.forEach(link => {
const abshref = link.href;
link.setAttribute('href', abshref);
});
""")
body_element = driver.find_element(By.TAG_NAME, "body")
print(md(body_element.get_attribute("innerHTML")))
driver.quit()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment