Created
January 31, 2012 15:00
-
-
Save santhoshtr/1710925 to your computer and use it in GitHub Desktop.
levenshtein in php, supports multibyte characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?php | |
function levenshtein_php($str1, $str2){ | |
$length1 = mb_strlen( $str1, 'UTF-8'); | |
$length2 = mb_strlen( $str2, 'UTF-8'); | |
if( $length1 < $length2) return levenshtein_php($str2, $str1); | |
if( $length1 == 0 ) return $length2; | |
if( $str1 === $str2) return 0; | |
$prevRow = range( 0, $length2); | |
$currentRow = array(); | |
for ( $i = 0; $i < $length1; $i++ ) { | |
$currentRow=array(); | |
$currentRow[0] = $i + 1; | |
$c1 = mb_substr( $str1, $i, 1, 'UTF-8') ; | |
for ( $j = 0; $j < $length2; $j++ ) { | |
$c2 = mb_substr( $str2, $j, 1, 'UTF-8' ); | |
$insertions = $prevRow[$j+1] + 1; | |
$deletions = $currentRow[$j] + 1; | |
$substitutions = $prevRow[$j] + (($c1 != $c2)?1:0); | |
$currentRow[] = min($insertions, $deletions, $substitutions); | |
} | |
$prevRow = $currentRow; | |
} | |
return $prevRow[$length2]; | |
} | |
echo levenshtein_php( 'കട', 'കടല' )."\n"; | |
echo levenshtein_php( 'കട', 'കല' )."\n"; | |
echo levenshtein_php( 'കട', 'കടി' )."\n"; | |
echo levenshtein_php( 'abce', 'abcdf' )."\n"; |
@marceltud I think it is possible in 13 changes:
Change a to â - "încât de counștițe" - 1 total changes
Delete 't de cou' - "încânștițe" - 9 total changes
Delete 'ș' - "încântițe" - 10 total changes
Change i to a - "încânștațe" - 11 total changes
Change ț to t - "încântate" - 12 total changes
Delete 'e' - "încântat" - 13 total changes
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
Thank you very much for this function. I love it. I might have found one flaw or maybe I'm wrong in my assumptions. I'd be happy to hear from you.
When I try this function with
$string1 = 'încat de counștițe';
$string2= 'încântat';
the levenshtein difference is 13, which in my view is wrong.
The only 2 options that I see are the following changes to $string1:
which would lead to a difference of 17 changes
with an levensthein distance of 15
Am I wrong and there is another way that leads to a difference of 13? or can the above function be improved?
Many thanks