Skip to content

Instantly share code, notes, and snippets.

@hollanddd
Created December 5, 2012 18:39
Show Gist options
  • Save hollanddd/4218286 to your computer and use it in GitHub Desktop.
Save hollanddd/4218286 to your computer and use it in GitHub Desktop.
screen scraping practice with nokogiri
require 'rubygems'
require 'nokogiri'
require 'open-uri'
def wine_woot(url)
doc = Nokogiri::HTML(open(url))
data = doc.xpath('//*[@id="summary"]/div')
puts doc.xpath('html/head/title').text + ' as site_name'
puts url + data.xpath('hgroup/a/@href').to_s + ' as site_href'
puts doc.xpath('//*[@id="todays-deal"]/a/img/@src').to_s + ' as img_src'
puts data.at_css('h2.fn').text + ' as wine_name'
puts data.at_css('.price').text + ' as price'
end
def cinderella_wine(url)
doc = Nokogiri::HTML(open(url))
puts doc.xpath('html/head/title').text.split('-').first.strip + ' as site_name'
puts "#{url} as site_href"
puts doc.xpath('//*[@id="bottle-shot"]/a/img/@src').to_s + 'as img_src'
puts doc.xpath('//*[@id="title"]/h2/a').text + ' as wine_name'
puts doc.xpath('//*[@id="product-dollars"]').text.strip + "." + doc.xpath('//*[@id="product-cents"]').text.strip + ' as price'
end
def last_call_wines(url)
doc = Nokogiri::HTML(open(url))
puts doc.xpath('html/head/title').text.split('|').last.strip + ' as site_name'
puts "#{url} as site_href"
puts "#{doc.xpath('//*[@id="divProductPic"]/img/@src').to_s} as img_src"
puts "#{doc.at_css('h3.ProductNameText2').text} as wine_name"
puts "#{doc.at_css('td.SalePrice/div.price').text.strip} as price"
end
def bacchus_selections(url)
doc = Nokogiri::HTML(open(url))
puts doc.xpath('html/head/title').text.split('-').last.strip
puts "#{url} as site_href"
puts "#{url}#{doc.xpath('//*[@id="content"]/div[1]/div[2]/a/img/@src').to_s.split('?').first} as img_src"
puts "#{doc.xpath('//*[@id="content"]/div[1]/div[1]/h2/a').text} as wine_name"
puts "#{doc.xpath('//*[@id="content"]/div[1]/div[3]/p[1]').text.strip} as price"
end
def wine_spies(url)
doc = Nokogiri::HTML(open(url))
puts doc.xpath('html/head/title').text.split('-').first.strip
puts "#{url} as site_href"
puts "#{url}#{doc.xpath('//*[@id="wine-thumb"]/@src').to_s.split('?').first} as img_src"
puts "#{doc.at_css('h2#wine-name').text.strip.split.join(" ")} as wine_name"
puts "#{doc.at_css('td.our-price/div.inner').text.strip} as price"
end
url = %w{ http://www.wine.woot.com http://www.cinderellawine.com http://www.lastcallwines.com http://www.bacchusselections.com http://www.thewinespies.com http://www.cellarthief.com }
url.each do |u|
case u
when 'http://www.wine.woot.com'
wine_woot u
when 'http://www.cinderellawine.com'
cinderella_wine u
when 'http://www.lastcallwines.com'
last_call_wines u
when 'http://www.bacchusselections.com'
bacchus_selections u
when 'http://www.thewinespies.com'
wine_spies u
when 'http://www.cellarthief.com'
cellar_thief u
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment