Created
July 24, 2010 16:41
-
-
Save brenes/488803 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Implementation inspired by "Simple simhashing" by Ryan Moulton | |
# http://knol.google.com/k/simple-simhashing | |
# For simhashing we take ngrams and calculate their hash | |
# It could be interesting to change the ngram size, but '2' | |
# is a good value according to my tests | |
def simhashing sentence, ngram_size = 2 | |
terms = sentence.downcase.split " " | |
hashes = [] | |
(terms.size-ngram_size+1).times do |i| | |
hashes << terms[i,ngram_size].join(" ").hash | |
end | |
hashes | |
end | |
# Once we have the simhashing we can compare two sentences | |
# by comparing their hashes (size of the intersection / size of the union) | |
def similarity sentence1, sentence2, ngram_size = 2 | |
hash1 = simhashing sentence1, ngram_size | |
hash2 = simhashing sentence2, ngram_size | |
return (hash1 & hash2).size / (hash1 + hash2).uniq.size.to_f | |
end | |
### Usage | |
puts similarity("This is a sentence", "this is another sentence") | |
puts simhashing("This is a sentence") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment