Last active
December 27, 2015 04:09
-
-
Save andykingking/7264908 to your computer and use it in GitHub Desktop.
Rough implementation of the Sørensen index of two strings
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def sørensen_index(string_a, string_b) | |
matches_a = get_bigrams string_a.dup | |
matches_b = get_bigrams string_b.dup | |
similarities = matches_a & matches_b | |
sum_bigrams = matches_a.count + matches_b.count | |
2 * similarities.count / sum_bigrams.to_f | |
end | |
def get_bigrams(str) | |
bigrams = [] | |
while str.length > 1 do | |
bigrams << str[0..1] | |
str[0] = '' | |
end | |
bigrams | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class SorensenIndex | |
def initialize(*strings) | |
@bigram_sets = strings.map {|s| BigramSet.new s} | |
end | |
def similarities | |
@bigram_sets.first & @bigram_sets.last | |
end | |
def total | |
@bigram_sets.first.count + @bigram_sets.last.count | |
end | |
def calculate | |
2 * similarities.count / total.to_f | |
end | |
end | |
class BigramSet | |
attr_reader :bigrams | |
def initialize(string) | |
get_bigrams string.dup | |
end | |
def &(alt_set) | |
@bigrams & alt_set.bigrams | |
end | |
def count | |
@bigrams.count | |
end | |
private | |
def get_bigrams(str) | |
@bigrams = [] | |
while str.length > 1 do | |
@bigrams << str[0..1] | |
str[0] = '' | |
end | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class String | |
include Enumerable | |
alias_method :each, :each_char | |
def bigrams | |
self.each_cons(2).to_a | |
end | |
end | |
class SorensenIndex | |
def initialize(*strings) | |
@bigram_sets = strings.map {|s| s.bigrams} | |
end | |
def calculate | |
2 * similarities.count / total.to_f | |
end | |
class << self | |
def calculate(*strings) | |
SorensenIndex.new(strings).calculate | |
end | |
end | |
private | |
def similarities | |
@bigram_sets.first & @bigram_sets.last | |
end | |
def total | |
@bigram_sets.first.count + @bigram_sets.last.count | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment