Created
November 14, 2008 20:13
-
-
Save alcabanillas-engh/25068 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'rubygems' | |
require 'simple-rss' | |
require 'open-uri' | |
require 'hpricot' | |
require 'couchrest' | |
require 'xmlsimple' | |
BASE_URI = 'http://www.sec.gov' | |
#Connect to our document database | |
db = CouchRest.database!("http://localhost:5984/sec-edgar") | |
#Fetch the RSS feed | |
rss = SimpleRSS.parse open(BASE_URI + '/Archives/edgar/xbrlrss.xml') | |
#Iterate through each instance of the feed | |
rss.items.each do |link| | |
doc = Hpricot(open(link[:link])) | |
#Fetch all of the links that are for XML documents | |
xml_links = doc.search("a").select { |ele| ele.inner_text =~ /.xml/ } | |
#For each XML document lets process it and throw it in the document database | |
xml_links.each do |xml_link| | |
xml_link = BASE_URI + xml_link.attributes["href"] | |
xml_data = XmlSimple.xml_in(Net::HTTP.get(URI.parse(xml_link))) | |
db.save(xml_data) | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment