Skip to content

Instantly share code, notes, and snippets.

@brodock
Created March 14, 2014 15:27
Show Gist options
  • Save brodock/9549986 to your computer and use it in GitHub Desktop.
Save brodock/9549986 to your computer and use it in GitHub Desktop.
This snippet is used to get a bunch of articles from a txt file (each url in a new line) and count the amount of words of each, sum it up at the end
require 'mechanize'
require 'readability'
require 'progress'
require 'nokogiri'
list = File.readlines('./urls_to_count.txt').map{|line| line.chomp}
agent = Mechanize.new
articles = []
list.with_progress("Loading articles") do |url|
source = agent.get(url).content
article = Readability::Document.new(source).content
articles << article
sleep(3)
end
articles.collect! do |article|
doc = Nokogiri::HTML(article)
doc.xpath("//text()").remove.to_s
end
wordcount = articles.map {|article| article.split.size}.inject{|sum,count| sum + count}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment