Skip to content

Instantly share code, notes, and snippets.

@skorfmann
Created November 28, 2010 13:31
Show Gist options
  • Select an option

  • Save skorfmann/718930 to your computer and use it in GitHub Desktop.

Select an option

Save skorfmann/718930 to your computer and use it in GitHub Desktop.
spider.rb
require 'rubygems'
require 'anemone'
count = 0
Anemone.crawl("http://www.immobilienscout24.de/immobiliensuche/wohnen-auf-zeit/nordrhein-westfalen.htm", :storage => Anemone::Storage.TokyoCabinet("/tmp/spidr.tch")) do |anemone|
anemone.focus_crawl do |page|
page.links.select do |link|
link.request_uri =~ /(wohnen-auf-zeit)|(\/\d+$)/
end
end
anemone.on_pages_like(/\/\d+$/) do |page|
count += 1
puts "scanning site number #{count} #{page.url}" if page
puts page.doc.css('.is24-headline-type01f').text if page.doc
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment