Skip to content

Instantly share code, notes, and snippets.

@alcabanillas-engh
Created November 14, 2008 20:13
Show Gist options
  • Save alcabanillas-engh/25068 to your computer and use it in GitHub Desktop.
Save alcabanillas-engh/25068 to your computer and use it in GitHub Desktop.
require 'rubygems'
require 'simple-rss'
require 'open-uri'
require 'hpricot'
require 'couchrest'
require 'xmlsimple'
BASE_URI = 'http://www.sec.gov'
#Connect to our document database
db = CouchRest.database!("http://localhost:5984/sec-edgar")
#Fetch the RSS feed
rss = SimpleRSS.parse open(BASE_URI + '/Archives/edgar/xbrlrss.xml')
#Iterate through each instance of the feed
rss.items.each do |link|
doc = Hpricot(open(link[:link]))
#Fetch all of the links that are for XML documents
xml_links = doc.search("a").select { |ele| ele.inner_text =~ /.xml/ }
#For each XML document lets process it and throw it in the document database
xml_links.each do |xml_link|
xml_link = BASE_URI + xml_link.attributes["href"]
xml_data = XmlSimple.xml_in(Net::HTTP.get(URI.parse(xml_link)))
db.save(xml_data)
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment