-
-
Save run26kimo/e20dfd3a6d3e5c9163d2 to your computer and use it in GitHub Desktop.
Web Crawler Helper class based upon Poltergeist (PhantomJS).Using Capybara as framework for building webcrawlers is surprisingly convenient
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class Crawler < PoltergeistCrawler | |
def crawl | |
visit "https://news.ycombinator.com/" | |
click_on "More" | |
#page.evaluate_script("window.location = '/'") | |
doc = Nokogiri::HTML.parse(page.body, nil, 'utf-8') | |
return doc.search(".itemlist .title a").first.text | |
end | |
end | |
Crawler.new.crawl |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
gem 'capybara' | |
gem 'poltergeist' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def execute | |
# 爬 pixnet pv 需要模擬 mobile device | |
url = 'http://kikichang.pixnet.net/blog/post/30178335' | |
iphone_6_user_agent = "Mozilla/5.0(iPhone;U;CPUiPhoneOS4_0likeMacOSX;en-us)AppleWebKit/532.9(KHTML,likeGecko)Version/4.0.5Mobile/8A293Safari/6531.22.7" | |
page = HTTParty.get(url, headers: { "User-Agent" => iphone_6_user_agent }) | |
doc = Nokogiri::HTML.parse(page.body, nil, 'utf-8') | |
selecter = '#main > div.article-header > div.article-info.cf > span.favorite-show' | |
page_view_count = doc.search(selecter).text.to_i | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'capybara/poltergeist' | |
require 'capybara/dsl' | |
class PoltergeistCrawler | |
include Capybara::DSL | |
def initialize | |
Capybara.register_driver :poltergeist_crawler do |app| | |
Capybara::Poltergeist::Driver.new(app, { | |
:js_errors => false, | |
:inspector => false, | |
phantomjs_logger: open('/dev/null') # if you don't care about JS errors/console.logs | |
}) | |
end | |
Capybara.default_max_wait_time = 60 | |
Capybara.run_server = false | |
Capybara.default_driver = :poltergeist_crawler | |
page.driver.headers = { | |
"DNT" => 1, | |
"User-Agent" => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0" | |
#mobile user agent "User-Agent" => "Mozilla/5.0(iPhone;U;CPUiPhoneOS4_0likeMacOSX;en-us)AppleWebKit/532.9(KHTML,likeGecko)Version/4.0.5Mobile/8A293Safari/6531.22.7" | |
} | |
end | |
# handy to peek into what the browser is doing right now | |
def screenshot(name="screenshot") | |
page.driver.render("public/#{name}.jpg",full: true) | |
end | |
# find("path") and all("path") work ok for most cases. Sometimes I need more control, like finding hidden fields | |
def doc | |
Nokogiri.parse(page.body) | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment