Skip to content

Instantly share code, notes, and snippets.

@philk
Created May 15, 2010 14:04
Show Gist options
  • Select an option

  • Save philk/402210 to your computer and use it in GitHub Desktop.

Select an option

Save philk/402210 to your computer and use it in GitHub Desktop.
Web page word count parser
#!/usr/bin/env ruby
# Usage:
# ruby parser.rb "http://www.leapfile.com"
require "hpricot"
require "open-uri"
page = ARGV[0]
if page == nil
puts "Usage:"
puts "ruby parser.rb \"http://www.leapfile.com\""
exit
end
doc = Hpricot(open(page))
# Get all text elements
all_text = doc.search("*").grep(Hpricot::Text)
# Get rid of script, style, and doctype
text_only = all_text.select { |t| (t.parent.name != "script" && t.parent.name != "style" && t.parent.name != nil) }
# Convert elements to strings, strip whitespace (\n and \t), separate words, then flatten the array.
all_words = text_only.map { |t| t.to_s.strip.split(/[^a-zA-Z]/) }.flatten
# Get rid of blanks
words = all_words.select { |t| t != "" }
#puts words
puts words.size #=> Should return 502 on leapfile.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment