Last active
August 15, 2018 12:22
-
-
Save zealot128/6524687 to your computer and use it in GitHub Desktop.
Web Crawler Helper class based upon Poltergeist (PhantomJS).Using Capybara as framework for building webcrawlers is surprisingly convenient
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class ExampleCrawler < PoltergeistCrawler | |
def crawl | |
visit "https://news.ycombinator.com/" | |
click_on "More" | |
page.evaluate_script("window.location = '/'") | |
end | |
end | |
ExampleCrawler.new.crawl |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'capybara/poltergeist' | |
require 'capybara/dsl' | |
class PoltergeistCrawler | |
include Capybara::DSL | |
def initialize | |
Capybara.register_driver :poltergeist_crawler do |app| | |
Capybara::Poltergeist::Driver.new(app, { | |
:js_errors => false, | |
:inspector => false, | |
phantomjs_logger: open('/dev/null') # if you don't care about JS errors/console.logs | |
}) | |
end | |
Capybara.default_wait_time = 3 | |
Capybara.run_server = false | |
Capybara.default_driver = :poltergeist_crawler | |
page.driver.headers = { | |
"DNT" => 1, | |
"User-Agent" => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0" | |
} | |
end | |
# handy to peek into what the browser is doing right now | |
def screenshot(name="screenshot") | |
page.driver.render("public/#{name}.jpg",full: true) | |
end | |
# find("path") and all("path") work ok for most cases. Sometimes I need more control, like finding hidden fields | |
def doc | |
Nokogiri.parse(page.body) | |
end | |
end |
Worked with this Gemfile.loc:
GEM
remote: https://rubygems.org/
specs:
addressable (2.4.0)
capybara (2.7.1)
addressable
mime-types (>= 1.16)
nokogiri (>= 1.3.3)
rack (>= 1.0.0)
rack-test (>= 0.5.4)
xpath (~> 2.0)
cliver (0.3.2)
coderay (1.1.1)
jsoner (0.0.4)
method_source (0.8.2)
mime-types (3.0)
mime-types-data (~> 3.2015)
mime-types-data (3.2016.0221)
mini_portile2 (2.0.0)
multi_json (1.12.0)
nokogiri (1.6.7.2)
mini_portile2 (~> 2.0.0.rc2)
poltergeist (1.9.0)
capybara (~> 2.1)
cliver (~> 0.3.1)
multi_json (~> 1.0)
websocket-driver (>= 0.2.0)
pry (0.10.3)
coderay (~> 1.1.0)
method_source (~> 0.8.1)
slop (~> 3.4)
rack (1.6.4)
rack-test (0.6.3)
rack (>= 1.0)
slop (3.6.0)
websocket-driver (0.6.3)
websocket-extensions (>= 0.1.0)
websocket-extensions (0.1.2)
xpath (2.0.0)
nokogiri (~> 1.3)
PLATFORMS
ruby
DEPENDENCIES
capybara
jsoner
poltergeist
pry
BUNDLED WITH
1.11.2
thanks for this!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Whats the ruby version ? Whats the capyBara version, poltergeist version ? This looks amazing!