Created
August 26, 2017 09:02
-
-
Save veb/c1beab69b5eb1b07123e5eaf55b80320 to your computer and use it in GitHub Desktop.
Scrapes the main page of HackerNews and returns an array of objects using Puppeteer and Cheerio
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
const puppeteer = require('puppeteer'); | |
const cheerio = require('cheerio'); | |
async function run() { | |
const browser = await puppeteer.launch(); | |
const page = await browser.newPage(); | |
await page.goto('https://news.ycombinator.com'); | |
let content = await page.content(); | |
var $ = cheerio.load(content); | |
$('span.comhead').each(function(i, element){ | |
var a = $(this).prev(); | |
var rank = a.parent().parent().text(); | |
var title = a.text(); | |
var url = a.attr('href'); | |
var subtext = a.parent().parent().next().children('.subtext').children(); | |
var points = $(subtext).eq(0).text(); | |
var username = $(subtext).eq(1).text(); | |
var comments = $(subtext).eq(2).text(); | |
var metadata = { | |
rank: parseInt(rank), | |
title: title, | |
url: url, | |
points: parseInt(points), | |
username: username, | |
comments: parseInt(comments) | |
}; | |
console.log(metadata); | |
}); | |
browser.close(); | |
} | |
run(); |
Very nice! The tradeoff for speed is obviously the browser-like behaviour which won't trigger captcha. :)
Amazing! Thanks for writing and sharing!
comments confused with time when news posted
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Cool! I think, doing it with the request module or request-promise and ofc cheerio is faster.