Created
March 14, 2014 15:27
-
-
Save brodock/9549986 to your computer and use it in GitHub Desktop.
This snippet is used to get a bunch of articles from a txt file (each url in a new line) and count the amount of words of each, sum it up at the end
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'mechanize' | |
require 'readability' | |
require 'progress' | |
require 'nokogiri' | |
list = File.readlines('./urls_to_count.txt').map{|line| line.chomp} | |
agent = Mechanize.new | |
articles = [] | |
list.with_progress("Loading articles") do |url| | |
source = agent.get(url).content | |
article = Readability::Document.new(source).content | |
articles << article | |
sleep(3) | |
end | |
articles.collect! do |article| | |
doc = Nokogiri::HTML(article) | |
doc.xpath("//text()").remove.to_s | |
end | |
wordcount = articles.map {|article| article.split.size}.inject{|sum,count| sum + count} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment