Skip to content

Instantly share code, notes, and snippets.

@jamescook
Last active December 25, 2015 22:39
Show Gist options
  • Save jamescook/7051069 to your computer and use it in GitHub Desktop.
Save jamescook/7051069 to your computer and use it in GitHub Desktop.
require 'mechanize'
URL_FILE = "qc_urls"
OUTPUT_FILE = "qc_urls_out"
class QuantcastCrawler
def initialize(file_path, output_path)
@file_path = file_path
@output_path = output_path
end
def process
File.open(@output_path, 'a') do |file|
urls.each do |url|
start = Time.now
url_info = QuantcastURL.new(url)
url_info.crawl
file.puts url_info.summary
puts "#{Time.now} - #{Time.now - start} - #{url}"
end
end
end
private
def urls
@urls ||= File.readlines(@file_path)
end
end
class QuantcastURL
attr_reader :summary
def initialize(url)
@url = url
end
def crawl
agent.get "https://www.quantcast.com/#{@url.downcase.gsub(/https?:\/\//, "")}"
@tr = agent.page.search("#wunit-hierarchy-table tr")
rescue Mechanize::ResponseCodeError => e
@summary = "#{@url.gsub(/\s+/, "")}^ERROR^ERROR^ERROR^ERROR^ERROR"
end
def summary
@summary ||= "#{@url.gsub(/\s+/, "")}^#{find_child_us_visitors}^#{find_child_global_visitors}^#{find_rank}" +
"^#{find_parent_us_visitors}^#{find_parent_global_visitors}"
end
private
def clean(result)
result.to_s.gsub(/\s+/,'').gsub(/US|Global/,'')
end
def agent
@agent ||= Mechanize.new { |agent|
agent.user_agent_alias = "Mac Safari"
}
end
def find_parent_us_visitors
visitors = @tr[1].search("td")[0].text if @tr.count == 3
clean(visitors) || ""
end
def find_parent_global_visitors
visitors = @tr[1].search("td")[1].text if @tr.count == 3
clean(visitors) || ""
end
def find_child_us_visitors
visitors = @tr[2].search("td")[0].text if @tr.count == 3
visitors ||= @tr[1].search("td")[0].text || ""
clean(visitors)
end
def find_child_global_visitors
visitors = @tr[1].search("td")[1].text if @tr.count == 3
visitors ||= @tr[1].search("td")[1].text || ""
clean(visitors)
end
def find_rank
pre = @agent.page.search("#siteStats/li/h4/a/span/strong")[0]
pre ? clean(pre.text) : ""
end
end
# Run it
QuantcastCrawler.new(URL_FILE, OUTPUT_FILE).process
@drwl
Copy link

drwl commented Oct 20, 2013

Thanks for forking my script, I learned a couple things and I used that knowledge in the new revision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment