Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save brenes/488803 to your computer and use it in GitHub Desktop.
Save brenes/488803 to your computer and use it in GitHub Desktop.
# Implementation inspired by "Simple simhashing" by Ryan Moulton
# http://knol.google.com/k/simple-simhashing
# For simhashing we take ngrams and calculate their hash
# It could be interesting to change the ngram size, but '2'
# is a good value according to my tests
def simhashing sentence, ngram_size = 2
terms = sentence.downcase.split " "
hashes = []
(terms.size-ngram_size+1).times do |i|
hashes << terms[i,ngram_size].join(" ").hash
end
hashes
end
# Once we have the simhashing we can compare two sentences
# by comparing their hashes (size of the intersection / size of the union)
def similarity sentence1, sentence2, ngram_size = 2
hash1 = simhashing sentence1, ngram_size
hash2 = simhashing sentence2, ngram_size
return (hash1 & hash2).size / (hash1 + hash2).uniq.size.to_f
end
### Usage
puts similarity("This is a sentence", "this is another sentence")
puts simhashing("This is a sentence")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment