Skip to content

Instantly share code, notes, and snippets.

@defx
Created October 12, 2022 12:53
Show Gist options
  • Save defx/03c967b9d632c59dd2376ed9da929c27 to your computer and use it in GitHub Desktop.
Save defx/03c967b9d632c59dd2376ed9da929c27 to your computer and use it in GitHub Desktop.
Recursive site scraper using Puppeteer
import puppeteer from "puppeteer-core"
const unique = (arr) => [...new Set(arr)]
(async () => {
const browser = await puppeteer.launch({
headless: true,
channel: "chrome",
timeout: 60000,
})
const page = await browser.newPage()
async function scrape(links, callback, visited = new Set()) {
const [url, ...rest] = links
if (!url) return [...visited.entries()]
if (visited.has(url)) return scrape(rest, callback, visited)
await page.goto(url, { waitUntil: "networkidle0" })
const hrefs = await page.$$eval("a[href]", (anchors) =>
anchors.map((el) => el.href)
)
const html = await page.content()
callback(url, html)
visited.add(url)
return scrape(unique(rest.concat(hrefs)), callback, visited)
}
await scrape(["http://localhost:3000/"], (href, html) => {
// ...
})
})()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment