Created
January 8, 2014 05:04
-
-
Save brycemcd/8312053 to your computer and use it in GitHub Desktop.
quick Nokogiri script to pull headlines out of news articles
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'rubygems' | |
require 'nokogiri' | |
require 'open-uri' | |
# Takes a URL for a news site (like Huffpo or Yahoo) as input and outputs the Headline | |
url = ARGV[0] | |
resource = NokoGiri::XML(open(url)) | |
# any news site should have only one h1 and the h1 should be their headline, | |
# but who knows? | |
resource.search("h1").each do |h1| | |
puts h1.text # will discard all elements that may be in an H1 and only output text | |
end | |
# NOTE: This is completely from memory and untested, there may be bugs | |
# USAGE: | |
# From a console where headlines.rb is in the directory: | |
# ruby headlines.rb http://news.yahoo.com/record-freeze-extends-eastern-united-states-least-nine-004335490--sector.html | |
# ruby headlines.rb http://www.huffingtonpost.com/azeem-khan/heres-why-the-nyc-bitcoin_b_4551792.html?utm_hp_ref=technology&ir=Technology |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment