Skip to content

Instantly share code, notes, and snippets.

@nickstenning
Created March 25, 2009 16:30
Show Gist options
  • Select an option

  • Save nickstenning/85562 to your computer and use it in GitHub Desktop.

Select an option

Save nickstenning/85562 to your computer and use it in GitHub Desktop.
BBC headline scraper. Requires Hpricot.
require 'rubygems'
require 'hpricot'
require 'open-uri'
BBC_URL = 'http://news.bbc.co.uk'
SELECTORS = '.ticker_content_anchor , #other-top-stories .story, #third-story .story, #most-popular .story, #featured-site-top-stories---democracy-live .story, #more-from-bbc-news .story, #featured-site-top-stories---bbc-sport .story, #geo-uk-news-digest li .story, #also-in-the-news .story, #second-story .story, #top-story .story'
bbc = open(BBC_URL)
links = Hpricot(bbc).search(SELECTORS)
links.each do |link|
title = link.inner_text.strip
url = link['href']
unless title.empty? or url.empty?
puts title + ": " + (url.index('http://') == 0 ? url : BBC_URL + url)
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment