Last active
August 31, 2018 09:09
-
-
Save crashtech/92acde79829dd696e17b36fa5ee76542 to your computer and use it in GitHub Desktop.
Similarity algorithms
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def cosine_distance(this, that) | |
return 1.0 if this === that | |
a_vector = Hash.new(0).tap{ |v| this.each{ |e| v[e] += 1 } } | |
b_vector = Hash.new(0).tap{ |v| that.each{ |e| v[e] += 1 } } | |
a_mag = 0.0 | |
product = a_vector.inject(0.0) do |total, (key, value)| | |
a_mag += value ** 2 | |
total += value * b_vector[key] | |
end | |
b_mag = b_vector.values.each_with_object(2).map(&:**).sum | |
product / (Math.sqrt(a_mag) * Math.sqrt(b_mag)) | |
end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# http://www.informit.com/articles/article.aspx?p=683059&seqNum=36 | |
def levenshtein_distance(this, that, ins = 2, del = 2, sub = 1) | |
# ins, del, sub are weighted costs | |
return nil if this.nil? | |
return nil if that.nil? | |
dm = [] # distance matrix | |
# Initialize first row values | |
dm[0] = (0..this.length).collect {|i| i * ins } | |
fill = [0] * (this.length - 1) | |
# Initialize first column values | |
(1..that.length).each do |i| | |
dm[i] = [i * del, fill.flatten] | |
end | |
# populate matrix | |
(1..that.length).each do |i| | |
(1..this.length).each do |j| | |
# critical comparison | |
dm[i][j] = [ | |
dm[i - 1][j - 1] + (this[j - 1] == that[i - 1] ? 0 : sub), | |
dm[i][j - 1] + ins, | |
dm[i - 1][j] + del | |
].min | |
end | |
end | |
# The last value in matrix is the Levenshtein distance between the strings | |
dm[that.length][this.length] | |
end | |
# Get the worst possible value | |
def levenshtein_maximum(expected, options, ins = 2, del = 2, sub = 1) | |
diff = options.size - expected.size | |
adding = expected.size * ins | |
removing = (diff * del) + (expected.size * sub) | |
adding >= removing ? adding : removing | |
end | |
# Get how precise that is when comparing it with this given the list of options | |
def levenshtein_precision(this, that, options) | |
return 0.0 if that.length == 0 && this.length > 0 | |
maximum = levenshtein_maximum(this, options) | |
(maximum - levenshtein_distance(this, that)).fdiv(maximum) | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment