Created
September 6, 2011 04:18
-
-
Save rtanglao/1196568 to your computer and use it in GitHub Desktop.
generate2GramsAnd3GramsInRuby
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # from http://markwatson.com/aiblog/2007/06/n-gram-analysis-using-ruby.html | |
| require 'sanitize' | |
| a="This is the SECOND time I have asked the same question over the last couple of weeks and not had any reply!! | |
| I cannot delete messages from ALL Mail; they just keep coming back into the folder. | |
| I have tried compacting and it makes no difference. Sometimes the messages do delete for a couple days but always come back. | |
| If I do not get a reply soon then I will unistall and get rid of Thunderbird" | |
| bi_grams = Hash.new(0) | |
| tri_grams = Hash.new(0) | |
| def words text | |
| Sanitize.clean(text).downcase.scan(/\w+/) | |
| end | |
| $words = words(a) | |
| num = $words.length - 2 | |
| num.times {|i| | |
| bi = $words[i] + ' ' + $words[i+1] | |
| tri = bi + ' ' + $words[i+2] | |
| bi_grams[bi] += 1 | |
| tri_grams[tri] += 1 | |
| } | |
| puts "bi-grams:" | |
| bb = bi_grams.sort{|a,b| b[1] <=> a[1]} | |
| (num / 10).times {|i| puts "#{bb[i][0]} : #{bb[i][1]}"} | |
| puts "tri-grams:" | |
| tt = tri_grams.sort{|a,b| b[1] <=> a[1]} | |
| (num / 10).times {|i| puts "#{tt[i][0]} : #{tt[i][1]}"} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| bi-grams: | |
| i have : 2 | |
| folder i : 1 | |
| this is : 1 | |
| question over : 1 | |
| same question : 1 | |
| back if : 1 | |
| not get : 1 | |
| tri-grams: | |
| always come back : 1 | |
| i have asked : 1 | |
| all mail they : 1 | |
| do not get : 1 | |
| the same question : 1 | |
| the messages do : 1 | |
| makes no difference : 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| >> bb = bi_grams.sort{|a,b| b[1] <=> a[1]} | |
| => [["i have", 2], ["folder i", 1], ["this is", 1], ["question over", 1], | |
| ["same question", 1], ["back if", 1], ["not get", 1], ["get rid", 1], | |
| ["it makes", 1], ["but always", 1], ["a reply", 1], ["do not", 1], | |
| ["no difference", 1], ["couple of", 1], ["keep coming", 1], | |
| ["had any", 1], ["just keep", 1], ["weeks and", 1], ["coming back", 1], | |
| ["they just", 1], ["the messages", 1], ["not had", 1], ["reply i", 1], | |
| ["delete messages", 1], ["get a", 1], ["messages do", 1], | |
| ["delete for", 1], ["if i", 1], ["second time", 1], ["any reply", 1], | |
| ["have asked", 1], ["then i", 1], ["mail they", 1], ["time i", 1], | |
| ["from all", 1], ["cannot delete", 1], ["have tried", 1], ["i cannot", 1], | |
| ["over the", 1], ["unistall and", 1], ["the same", 1], ["makes no", 1], | |
| ["is the", 1], ["compacting and", 1], ["and it", 1], ["and get", 1], | |
| ["the last", 1], ["always come", 1], ["all mail", 1], ["soon then", 1], | |
| ["tried compacting", 1], ["couple days", 1], ["rid of", 1], | |
| ["days but", 1], ["come back", 1], ["do delete", 1], ["and not", 1], | |
| ["the folder", 1], ["back into", 1], ["last couple", 1], | |
| ["will unistall", 1], ["asked the", 1], ["difference sometimes", 1], | |
| ["sometimes the", 1], ["for a", 1], ["i will", 1], ["messages from", 1], | |
| ["a couple", 1], ["of weeks", 1], ["the second", 1], ["into the", 1], | |
| ["reply soon", 1], ["i do", 1]] | |
| >> (num / 10).times {|i| puts "#{bb[i][0]} : #{bb[i][1]}"} | |
| i have : 2 | |
| folder i : 1 | |
| this is : 1 | |
| question over : 1 | |
| same question : 1 | |
| back if : 1 | |
| not get : 1 | |
| => 7 | |
| >> puts "tri-grams:" | |
| tri-grams: | |
| => nil | |
| >> tt = tri_grams.sort{|a,b| b[1] <=> a[1]} | |
| => [["always come back", 1], ["i have asked", 1], ["all mail they", 1], | |
| ["do not get", 1], ["the same question", 1], ["the messages do", 1], | |
| ["makes no difference", 1], ["they just keep", 1], | |
| ["difference sometimes the", 1], ["a couple days", 1], ["the second time", 1], | |
| ["get a reply", 1], ["into the folder", 1], ["not get a", 1], ["a reply soon", 1], | |
| ["is the second", 1], ["if i do", 1], ["couple days but", 1], | |
| ["cannot delete messages", 1], ["come back if", 1], | |
| ["have tried compacting", 1], ["back into the", 1], | |
| ["no difference sometimes", 1], ["compacting and it", 1], | |
| ["this is the", 1], ["then i will", 1], ["i have tried", 1], | |
| ["i will unistall", 1], ["do delete for", 1], ["couple of weeks", 1], | |
| ["and get rid", 1], ["sometimes the messages", 1], ["time i have", 1], | |
| ["same question over", 1], ["back if i", 1], ["the last couple", 1], | |
| ["second time i", 1], ["but always come", 1], ["just keep coming", 1], | |
| ["will unistall and", 1], ["of weeks and", 1], ["unistall and get", 1], | |
| ["not had any", 1], ["from all mail", 1], ["reply soon then", 1], | |
| ["had any reply", 1], ["have asked the", 1], ["and not had", 1], | |
| ["messages from all", 1], ["get rid of", 1], ["asked the same", 1], | |
| ["reply i cannot", 1], ["mail they just", 1], ["question over the", 1], | |
| ["delete messages from", 1], ["soon then i", 1], ["weeks and not", 1], | |
| ["keep coming back", 1], ["the folder i", 1], ["tried compacting and", 1], | |
| ["messages do delete", 1], ["delete for a", 1], ["days but always", 1], | |
| ["i cannot delete", 1], ["coming back into", 1], ["folder i have", 1], | |
| ["and it makes", 1], ["over the last", 1], ["rid of thunderbird", 1], | |
| ["it makes no", 1], ["last couple of", 1], ["i do not", 1], | |
| ["any reply i", 1], ["for a couple", 1]] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment