Skip to content

Instantly share code, notes, and snippets.

@santhoshtr
Created January 31, 2012 15:00
Show Gist options
  • Save santhoshtr/1710925 to your computer and use it in GitHub Desktop.
Save santhoshtr/1710925 to your computer and use it in GitHub Desktop.
levenshtein in php, supports multibyte characters
<?php
function levenshtein_php($str1, $str2){
$length1 = mb_strlen( $str1, 'UTF-8');
$length2 = mb_strlen( $str2, 'UTF-8');
if( $length1 < $length2) return levenshtein_php($str2, $str1);
if( $length1 == 0 ) return $length2;
if( $str1 === $str2) return 0;
$prevRow = range( 0, $length2);
$currentRow = array();
for ( $i = 0; $i < $length1; $i++ ) {
$currentRow=array();
$currentRow[0] = $i + 1;
$c1 = mb_substr( $str1, $i, 1, 'UTF-8') ;
for ( $j = 0; $j < $length2; $j++ ) {
$c2 = mb_substr( $str2, $j, 1, 'UTF-8' );
$insertions = $prevRow[$j+1] + 1;
$deletions = $currentRow[$j] + 1;
$substitutions = $prevRow[$j] + (($c1 != $c2)?1:0);
$currentRow[] = min($insertions, $deletions, $substitutions);
}
$prevRow = $currentRow;
}
return $prevRow[$length2];
}
echo levenshtein_php( 'കട', 'കടല' )."\n";
echo levenshtein_php( 'കട', 'കല' )."\n";
echo levenshtein_php( 'കട', 'കടി' )."\n";
echo levenshtein_php( 'abce', 'abcdf' )."\n";
@marceltud
Copy link

Hi,
Thank you very much for this function. I love it. I might have found one flaw or maybe I'm wrong in my assumptions. I'd be happy to hear from you.

When I try this function with
$string1 = 'încat de counștițe';
$string2= 'încântat';

the levenshtein difference is 13, which in my view is wrong.

The only 2 options that I see are the following changes to $string1:

1)
a) add the characters 'ânt' to reach $string1 = 'încântat de counștițe';
b) delete the characters ' de counștițe' to reach $string1 = 'încântat' ;

which would lead to a difference of 17 changes

2) 
a) replace 'a' with 'â' to reach $string1 = 'încât de counștițe';
b) delete 't de cou' to reach $string1 = 'încânștițe';
c) delete 'ș' to reach $string1 = 'încântițe';
d) add characters 'at' to reach to reach $string1 = 'încântatițe';
e) remove characters 'ițe' to reach $string1 = 'încântat';

with an levensthein distance of 15

Am I wrong and there is another way that leads to a difference of 13? or can the above function be improved?

Many thanks

@lrotherfield-function
Copy link

@marceltud I think it is possible in 13 changes:

Change a to â -     "încât de counștițe" - 1 total changes
Delete 't de cou' - "încânștițe" - 9 total changes
Delete 'ș' -        "încântițe" - 10 total changes
Change i to a -     "încânștațe" - 11 total changes
Change ț to t -     "încântate" - 12 total changes
Delete 'e' -        "încântat" - 13 total changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment