Created
September 8, 2014 18:24
-
-
Save mattSpell/63019a8a0d87a1cefabe to your computer and use it in GitHub Desktop.
Web Content Scrapers
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Web Content Scrapers: | |
##Recommendation: | |
- open-uri - http://ruby-doc.org/stdlib-2.1.0/libdoc/open-uri/rdoc/OpenURI.html | |
to be used along with: | |
- Nokogiri - http://nokogiri.org/ to parse through the HTML | |
##Notes: | |
- In the Ruby Toolbox, the top 2 are Anemone and Pismo, but they are intended for getting metadata from web sites, not necessarily the html content. | |
- Nokogiri can be a pain to install, but most of us should have already crossed that bridge with our other in-class projects | |
- Also, to help spot the CSS selector(s) that you want to grab, use http://selectorgadget.com/. There is a quick 1.5 minute video that explains exactly how to use it. | |
- It has been recommended that you keep the controllers light and put a scraping task into a model as a best practice. | |
##Other Resources: | |
- http://railscasts.com/episodes/190-screen-scraping-with-nokogiri - Great Video! | |
- http://ruby.bastardsbook.com/chapters/html-parsing/ | |
- This is not the best, but another example of the basic syntax you might use to scrape web content: | |
https://teamtreehouse.com/forum/im-stuck-on-how-to-integrate-a-nokogiri-scrape-into-my-rails-application |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment