This gist provides a JSON file showing the most used words within Hilary Clinton's email corpus.
I used the Emails.csv
file from within the open-source corpus [1] (~7,000 emails) released by Kaggle [2].
I ran a crude Ruby Tokenizer (code below) over the corpus and outputted the results as a JSON blob.
require "json"
class Tokenizer
def self.parse(s)
new.parse s
end
def parse(s)
tokens = []
s.split(/\s+/).each do |e|
next if non_word(e)
tokens << remove_punctuation(e).downcase.to_sym
end
tokens
end
private
def non_word(t)
t.nil? or t.empty? or t.index(/[a-zA-Z\d]/).nil?
end
def remove_punctuation(t)
start = t.index(/[a-zA-Z\d]/)
finish = t.rindex(/[a-zA-Z\d]/)
t[start..finish]
end
end
token_list = Hash.new 0
total_tokens = 0
File.open("./Emails.csv", "r") do |f|
f.each_line do |line|
puts line
Tokenizer.parse(line).each do |t|
token_list[t] += 1
total_tokens += 1
end
end
end
output = {
:unique => token_list.size,
:total => total_tokens,
:tokens => token_list.sort_by { |_, v| -v }.to_h
}
File.open("./token_list.json", "w") do |f|
f.write(output.to_json)
end
Link [3]
[1] - https://www.kaggle.com/kaggle/hillary-clinton-emails
[2] - https://www.kaggle.com
[3] - https://gist.githubusercontent.com/revett/1cf46e92b02c8797e3bc/raw/formatted_token_list.json
[4] - https://opensource.org/licenses/MIT
[5] - https://twitter.com/charlierevett