Created
August 10, 2012 09:38
-
-
Save Burgestrand/3312972 to your computer and use it in GitHub Desktop.
Threaded scraping with Capybara, Webkit and Celluloid
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
source :rubygems | |
gem 'pry' | |
gem 'capybara' | |
gem 'capybara-webkit' | |
gem 'celluloid' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
GEM | |
remote: http://rubygems.org/ | |
specs: | |
addressable (2.3.2) | |
capybara (1.1.3) | |
mime-types (>= 1.16) | |
nokogiri (>= 1.3.3) | |
rack (>= 1.0.0) | |
rack-test (>= 0.5.4) | |
selenium-webdriver (~> 2.0) | |
xpath (~> 0.1.4) | |
capybara-webkit (0.12.1) | |
capybara (>= 1.0.0, < 1.2) | |
json | |
celluloid (0.12.3) | |
facter (>= 1.6.12) | |
timers (>= 1.0.0) | |
childprocess (0.3.6) | |
ffi (~> 1.0, >= 1.0.6) | |
coderay (1.0.8) | |
facter (1.6.13) | |
ffi (1.1.5) | |
json (1.7.5) | |
libwebsocket (0.1.5) | |
addressable | |
method_source (0.8.1) | |
mime-types (1.19) | |
multi_json (1.3.6) | |
nokogiri (1.5.5) | |
pry (0.9.10) | |
coderay (~> 1.0.5) | |
method_source (~> 0.8) | |
slop (~> 3.3.1) | |
rack (1.4.1) | |
rack-test (0.6.2) | |
rack (>= 1.0) | |
rubyzip (0.9.9) | |
selenium-webdriver (2.25.0) | |
childprocess (>= 0.2.5) | |
libwebsocket (~> 0.1.3) | |
multi_json (~> 1.0) | |
rubyzip | |
slop (3.3.3) | |
timers (1.0.1) | |
xpath (0.1.4) | |
nokogiri (~> 1.3) | |
PLATFORMS | |
ruby | |
DEPENDENCIES | |
capybara | |
capybara-webkit | |
celluloid | |
pry |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
desc "Start an interactive session with the search loaded." | |
task :console do | |
exec 'bundle exec pry -r./search -I.' | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'bundler/setup' | |
require 'pry' | |
require 'celluloid' | |
require 'capybara/dsl' | |
require 'capybara/webkit' | |
require 'cgi' | |
Capybara.configure do |config| | |
config.run_server = false | |
config.default_driver = :webkit | |
end | |
class Search | |
include Celluloid | |
include Capybara::DSL | |
class << self | |
def href(href = nil) | |
@href = href if href | |
@href | |
end | |
end | |
def initialize(href = self.class.href) | |
@base_href = URI(href.to_s) | |
# Capybara requires all absolute URLs to start with http. | |
unless @base_href.scheme =~ /^http/ | |
raise ArgumentError, "base_href must be of http(s) scheme" | |
end | |
# Overridden, to make sure we have one session per actor. | |
@page = Capybara::Session.new(Capybara.default_driver) | |
# Configuration things are nice. | |
yield self if block_given? | |
end | |
protected | |
attr_reader :base_href | |
attr_reader :page | |
public | |
# Overrridden to avoid Capybara going to the server app_host | |
# when given relative URLs. | |
def visit(url) | |
url = URI(url) | |
base_href.path = url.path | |
base_href.query = url.query | |
Celluloid.logger.info "Visiting #{base_href}" | |
super(base_href.to_s) | |
end | |
def title | |
find('head title').text | |
end | |
end | |
class Google < Search | |
href 'https://www.google.com/' | |
def search(query) | |
visit('/search?q=%s' % CGI.escape(query)) | |
all("h3.r a").map do |link| | |
{ title: link.text, url: link[:href].sub(%r|\A/url\?q=|, "") } | |
end | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment