Created
December 16, 2008 15:47
-
-
Save jamster/36636 to your computer and use it in GitHub Desktop.
Scrapes top 100 greatest movie characters with Ruby and Nokogiri
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# = Top 100 Greatest Movie Characters Scraper | |
# This program just scrapes the empire online 100 | |
# greatest movie characters web site (http://www.empireonline.com/100-greatest-movie-characters/) | |
# and generates a simple page to display all the | |
# characters on one page so you don't have to go | |
# clicking through 100 pages | |
# | |
# This was just built b/c I'd rather learn something new | |
# during the time it took me to view all 100 characters | |
# and still get to see who they are. I'm too lazy to click | |
# through all of them | |
# Author:: Jason Amster (mailto:[email protected]) | |
# Copyright:: Copyright (c) 2008 Jason Amster | |
# License:: Distributes under the same terms as Ruby | |
require 'rubygems' | |
require 'nokogiri' | |
require 'open-uri' | |
#This class just scrapes the the pages and collects | |
# the relelvent information. Then it can generate HTML | |
# based upon that. | |
class Top100 | |
BASE_URL="http://www.empireonline.com/100-greatest-movie-characters/default.asp?c=" | |
def initialize | |
@top100 = [] | |
@html = "" | |
end | |
# Iterates 100 times and just scrapes each page collecting the position (redundant), | |
# name of the character, and the image | |
def scrape | |
(1..100).each do |num| | |
doc = Nokogiri::HTML(open(BASE_URL+num.to_s)) | |
elements = doc.xpath('//head/title').first.inner_html.split("|")[1].split(". ") | |
pos = elements.delete_at(0) | |
name = elements.join(". ") # For the few names that have a period in it... lazy hack | |
@top100 << { | |
:pos=>pos, | |
:name=>name.to_a.join(". "), | |
:image=>"http://www.empireonline.com/images/features/100greatestcharacters/photos/#{num}.jpg" | |
} | |
end | |
@top100 | |
end | |
# Checs to see if the *@top100* array has been set. If so, it returns it. If not, | |
# it runs the *scrape* method | |
def top100 | |
@top100.empty? ? scrape : @top100 | |
end | |
# Genreates simple HTML to for display of the scraped data | |
def generate | |
top100.each do |entry| | |
@html << <<-EOS | |
<div class="entry"> | |
<h1>#{entry[:pos]}. #{entry[:name]}</h1> | |
<div class="image"> | |
<img src="#{entry[:image]}" /> | |
</div> | |
</div> | |
EOS | |
end | |
@html | |
end | |
end | |
top100 = Top100.new | |
puts top100.generate | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment