Skip to content

Instantly share code, notes, and snippets.

@run26kimo
Forked from zealot128/crawler.rb
Last active March 15, 2016 03:07
Show Gist options
  • Save run26kimo/e20dfd3a6d3e5c9163d2 to your computer and use it in GitHub Desktop.
Save run26kimo/e20dfd3a6d3e5c9163d2 to your computer and use it in GitHub Desktop.
Web Crawler Helper class based upon Poltergeist (PhantomJS).Using Capybara as framework for building webcrawlers is surprisingly convenient
class Crawler < PoltergeistCrawler
def crawl
visit "https://news.ycombinator.com/"
click_on "More"
#page.evaluate_script("window.location = '/'")
doc = Nokogiri::HTML.parse(page.body, nil, 'utf-8')
return doc.search(".itemlist .title a").first.text
end
end
Crawler.new.crawl
gem 'capybara'
gem 'poltergeist'
def execute
# 爬 pixnet pv 需要模擬 mobile device
url = 'http://kikichang.pixnet.net/blog/post/30178335'
iphone_6_user_agent = "Mozilla/5.0(iPhone;U;CPUiPhoneOS4_0likeMacOSX;en-us)AppleWebKit/532.9(KHTML,likeGecko)Version/4.0.5Mobile/8A293Safari/6531.22.7"
page = HTTParty.get(url, headers: { "User-Agent" => iphone_6_user_agent })
doc = Nokogiri::HTML.parse(page.body, nil, 'utf-8')
selecter = '#main > div.article-header > div.article-info.cf > span.favorite-show'
page_view_count = doc.search(selecter).text.to_i
end
require 'capybara/poltergeist'
require 'capybara/dsl'
class PoltergeistCrawler
include Capybara::DSL
def initialize
Capybara.register_driver :poltergeist_crawler do |app|
Capybara::Poltergeist::Driver.new(app, {
:js_errors => false,
:inspector => false,
phantomjs_logger: open('/dev/null') # if you don't care about JS errors/console.logs
})
end
Capybara.default_max_wait_time = 60
Capybara.run_server = false
Capybara.default_driver = :poltergeist_crawler
page.driver.headers = {
"DNT" => 1,
"User-Agent" => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0"
#mobile user agent "User-Agent" => "Mozilla/5.0(iPhone;U;CPUiPhoneOS4_0likeMacOSX;en-us)AppleWebKit/532.9(KHTML,likeGecko)Version/4.0.5Mobile/8A293Safari/6531.22.7"
}
end
# handy to peek into what the browser is doing right now
def screenshot(name="screenshot")
page.driver.render("public/#{name}.jpg",full: true)
end
# find("path") and all("path") work ok for most cases. Sometimes I need more control, like finding hidden fields
def doc
Nokogiri.parse(page.body)
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment