Skip to content

Instantly share code, notes, and snippets.

@rtanglao
Created September 6, 2011 04:18
Show Gist options
  • Select an option

  • Save rtanglao/1196568 to your computer and use it in GitHub Desktop.

Select an option

Save rtanglao/1196568 to your computer and use it in GitHub Desktop.
generate2GramsAnd3GramsInRuby
# from http://markwatson.com/aiblog/2007/06/n-gram-analysis-using-ruby.html
require 'sanitize'
a="This is the SECOND time I have asked the same question over the last couple of weeks and not had any reply!!
I cannot delete messages from ALL Mail; they just keep coming back into the folder.
I have tried compacting and it makes no difference. Sometimes the messages do delete for a couple days but always come back.
If I do not get a reply soon then I will unistall and get rid of Thunderbird"
bi_grams = Hash.new(0)
tri_grams = Hash.new(0)
def words text
Sanitize.clean(text).downcase.scan(/\w+/)
end
$words = words(a)
num = $words.length - 2
num.times {|i|
bi = $words[i] + ' ' + $words[i+1]
tri = bi + ' ' + $words[i+2]
bi_grams[bi] += 1
tri_grams[tri] += 1
}
puts "bi-grams:"
bb = bi_grams.sort{|a,b| b[1] <=> a[1]}
(num / 10).times {|i| puts "#{bb[i][0]} : #{bb[i][1]}"}
puts "tri-grams:"
tt = tri_grams.sort{|a,b| b[1] <=> a[1]}
(num / 10).times {|i| puts "#{tt[i][0]} : #{tt[i][1]}"}
bi-grams:
i have : 2
folder i : 1
this is : 1
question over : 1
same question : 1
back if : 1
not get : 1
tri-grams:
always come back : 1
i have asked : 1
all mail they : 1
do not get : 1
the same question : 1
the messages do : 1
makes no difference : 1
>> bb = bi_grams.sort{|a,b| b[1] <=> a[1]}
=> [["i have", 2], ["folder i", 1], ["this is", 1], ["question over", 1],
["same question", 1], ["back if", 1], ["not get", 1], ["get rid", 1],
["it makes", 1], ["but always", 1], ["a reply", 1], ["do not", 1],
["no difference", 1], ["couple of", 1], ["keep coming", 1],
["had any", 1], ["just keep", 1], ["weeks and", 1], ["coming back", 1],
["they just", 1], ["the messages", 1], ["not had", 1], ["reply i", 1],
["delete messages", 1], ["get a", 1], ["messages do", 1],
["delete for", 1], ["if i", 1], ["second time", 1], ["any reply", 1],
["have asked", 1], ["then i", 1], ["mail they", 1], ["time i", 1],
["from all", 1], ["cannot delete", 1], ["have tried", 1], ["i cannot", 1],
["over the", 1], ["unistall and", 1], ["the same", 1], ["makes no", 1],
["is the", 1], ["compacting and", 1], ["and it", 1], ["and get", 1],
["the last", 1], ["always come", 1], ["all mail", 1], ["soon then", 1],
["tried compacting", 1], ["couple days", 1], ["rid of", 1],
["days but", 1], ["come back", 1], ["do delete", 1], ["and not", 1],
["the folder", 1], ["back into", 1], ["last couple", 1],
["will unistall", 1], ["asked the", 1], ["difference sometimes", 1],
["sometimes the", 1], ["for a", 1], ["i will", 1], ["messages from", 1],
["a couple", 1], ["of weeks", 1], ["the second", 1], ["into the", 1],
["reply soon", 1], ["i do", 1]]
>> (num / 10).times {|i| puts "#{bb[i][0]} : #{bb[i][1]}"}
i have : 2
folder i : 1
this is : 1
question over : 1
same question : 1
back if : 1
not get : 1
=> 7
>> puts "tri-grams:"
tri-grams:
=> nil
>> tt = tri_grams.sort{|a,b| b[1] <=> a[1]}
=> [["always come back", 1], ["i have asked", 1], ["all mail they", 1],
["do not get", 1], ["the same question", 1], ["the messages do", 1],
["makes no difference", 1], ["they just keep", 1],
["difference sometimes the", 1], ["a couple days", 1], ["the second time", 1],
["get a reply", 1], ["into the folder", 1], ["not get a", 1], ["a reply soon", 1],
["is the second", 1], ["if i do", 1], ["couple days but", 1],
["cannot delete messages", 1], ["come back if", 1],
["have tried compacting", 1], ["back into the", 1],
["no difference sometimes", 1], ["compacting and it", 1],
["this is the", 1], ["then i will", 1], ["i have tried", 1],
["i will unistall", 1], ["do delete for", 1], ["couple of weeks", 1],
["and get rid", 1], ["sometimes the messages", 1], ["time i have", 1],
["same question over", 1], ["back if i", 1], ["the last couple", 1],
["second time i", 1], ["but always come", 1], ["just keep coming", 1],
["will unistall and", 1], ["of weeks and", 1], ["unistall and get", 1],
["not had any", 1], ["from all mail", 1], ["reply soon then", 1],
["had any reply", 1], ["have asked the", 1], ["and not had", 1],
["messages from all", 1], ["get rid of", 1], ["asked the same", 1],
["reply i cannot", 1], ["mail they just", 1], ["question over the", 1],
["delete messages from", 1], ["soon then i", 1], ["weeks and not", 1],
["keep coming back", 1], ["the folder i", 1], ["tried compacting and", 1],
["messages do delete", 1], ["delete for a", 1], ["days but always", 1],
["i cannot delete", 1], ["coming back into", 1], ["folder i have", 1],
["and it makes", 1], ["over the last", 1], ["rid of thunderbird", 1],
["it makes no", 1], ["last couple of", 1], ["i do not", 1],
["any reply i", 1], ["for a couple", 1]]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment