Skip to content

Instantly share code, notes, and snippets.

@eric-burel
Created November 20, 2025 09:06
Show Gist options
  • Select an option

  • Save eric-burel/81f9c9603077eae67e634b54717d4b0d to your computer and use it in GitHub Desktop.

Select an option

Save eric-burel/81f9c9603077eae67e634b54717d4b0d to your computer and use it in GitHub Desktop.
// npm install turndown cheerio
import TurndownService from "turndown"
import { load } from "cheerio"
async function downloadUrlAsMarkdown(url: string, selector?: string) {
try {
const res = await fetch(url)
if (!res.ok) {
throw new Error(`${url} ${res.status} ${res.statusText}`)
}
const mimeType = res.headers.get("content-type")?.split(";")[0].toLowerCase()
if (!mimeType) {
throw new Error(`${url} unknown mime-type`)
}
if (!["text/html", "text/plain", "text/markdown"].includes(mimeType)) {
throw new Error(`${url} not text or HTML: '${mimeType}'`)
}
let body = await res.text()
let md = body
// convert html to markdown if needed
if (mimeType === "text/html") {
if (selector) {
const $ = load(body)
const element = $(selector).html()
if (!element) {
throw new Error(`${url} selector ${selector} doesn't get any content`)
}
body = element
}
// NOTE: parsing Vercel AI SDK and Mastra's docs
// doesn't actually work super well (titles are lost)...
// but it's good enough for our RAG
const turndownService = new TurndownService()
turndownService.remove(["script", "nav"])
md = turndownService.turndown(body)
}
return md
} catch (err) {
console.error(err)
throw (err)
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment