Skip to content

Instantly share code, notes, and snippets.

@BaseCase
Last active August 13, 2016 20:34
Show Gist options
  • Save BaseCase/7e7a38d45557fc012e5a8fa0ade410c3 to your computer and use it in GitHub Desktop.
Save BaseCase/7e7a38d45557fc012e5a8fa0ade410c3 to your computer and use it in GitHub Desktop.
Quick and dirty word frequency count for (more or less) just the text part of an HTML file
#!/usr/bin/env ruby
class WordFrequencyCounter
DELIMITERS = [
" ",
"\n",
"<",
">",
".",
"?",
"!",
",",
";",
":",
"—",
"(",
")",
"{",
"}",
"@",
]
def initialize(input_string)
@input = input_string
@counts = Hash.new(0)
end
def count
tokenized.map(&:downcase).each do |word|
@counts[word] += 1
end
@counts
end
private
def tokenized
inside_tag = false
current_word = []
words = []
@input.each_char do |c|
if DELIMITERS.include? c
if c == '<' then inside_tag = true end
if c == '>' then inside_tag = false end
if !inside_tag && !current_word.empty?
words << current_word.join
current_word.clear
end
else
if !inside_tag
current_word << c
end
end
end
words
end
end
if __FILE__ == $0
wordcounts = WordFrequencyCounter.new(ARGF.read).count
wordcounts.each do |k,v|
$stdout.write "#{v} #{k}\n"
end
end
@BaseCase
Copy link
Author

BaseCase commented Aug 13, 2016

When I run something like this:

cat index.html | html_word_frequency.rb | sort -n

then I get output like this (truncated from actual):

1 #eee
1 1
1 1px
1 5em
1 affects
1 after
1 all
1 also
1 an
1 anything
1 brant
1 can’t
2 adorable
2 and
2 at
2 blog
2 by
2 casey
2 doesn’t
2 first
2 guide
2 how-to
2 in
5 about
5 is
6 this
6 you
7 i
8 of
9 talk
9 that
11 a
12 the
13 to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment