Last active
May 26, 2023 23:12
-
-
Save tomstuart/9e4fd5cd96527debf7a685d0b5399635 to your computer and use it in GitHub Desktop.
Approximate substring matching with Levenshtein distance
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
module Levenshtein | |
def self.substring_distance(needle, haystack) | |
distances = Array.new(haystack.length.succ, 0) | |
needle.each_char.with_index do |needle_char, needle_index| | |
next_distances = [needle_index.succ] | |
haystack.each_char.with_index do |haystack_char, haystack_index| | |
deletion, insertion, substitution = | |
distances[haystack_index.succ].succ, | |
next_distances[haystack_index].succ, | |
distances[haystack_index] + (needle_char == haystack_char ? 0 : 1) | |
next_distances.push([deletion, insertion, substitution].min) | |
end | |
distances = next_distances | |
end | |
distances.min | |
end | |
end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'levenshtein' | |
RSpec.describe 'Levenshtein substring edit distance' do | |
[ | |
'Machu Picchu', 'Machu Picchu', 0, | |
'Machu Picchu', 'Where is Machu Picchu?', 0, | |
'Machu Picchu', 'Where is Machu Pichu?', 1, | |
'Machu Picchu', 'Where is Macchu Picchu?', 1, | |
'Stonehenge', 'How do I get to Stone Henge?', 2, | |
'', '', 0, | |
'', 'Machu Picchu', 0, | |
'u', 'Machu Picchu', 0, | |
'z', 'Machu Picchu', 1, | |
'Uluru', '', 5 | |
].each_slice(3) do |needle, haystack, distance| | |
specify "the shortest distance between “#{needle}” and part of “#{haystack}” is #{distance}" do | |
expect(Levenshtein.substring_distance(needle, haystack)).to eq distance | |
end | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Based on this information from https://en.wikipedia.org/wiki/Approximate_string_matching#Problem_formulation_and_algorithms: