Skip to content

Instantly share code, notes, and snippets.

@ndrut
Last active December 20, 2015 08:18
Show Gist options
  • Save ndrut/6099084 to your computer and use it in GitHub Desktop.
Save ndrut/6099084 to your computer and use it in GitHub Desktop.
Cheerio and Request, used to scrape posts from a vBulletin forum.
request({ 'url': url, 'headers': headers, 'jar': true } , function (err, resp, body){
if (err) throw err;
$ = cheerio.load(body);
$('table[id^="post"]').each(function(i,elem) {
id = $(elem).attr('id').match(/[0-9]+$/)[0];
author = $(elem).find('div[id^="postmenu_"]').find('a.bigusername').text().trim();
subject = $(elem).find('td[id^="td_post_"] div.smallfont strong').text().trim();
postbody = $(elem).find('td[id^="td_post_"] div[id^="post_message_"]').text().trim();
strdate = $(elem).find('div.normal').text().trim().split("\t").pop();
if (strdate.indexOf('Yesterday') > -1) {
yesterday = moment().subtract('days', 1).format("MM-DD-YYYY");
strdate = strdate.replace('Yesterday', yesterday);
}
if (strdate.indexOf('Today') > -1) {
today = moment().format("MM-DD-YYYY");
strdate = strdate.replace('Today', today);
}
post = {
'id': id,
'url': url,
'threadid': url.match(/t[0-9]+/)[0],
'author': author,
'subject': subject,
'postbody': postbody,
'strdate': strdate,
'unix': moment(strdate).unix(),
'date_added': moment().unix()
};
couch.save(id, post, function (err, res) {
if (err) { throw err; };
console.log(res);
});
});
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment