Created
February 14, 2018 20:26
-
-
Save jianminchen/bcd315bb5fd4c461438f5e45f6c4e178 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Julia likes to study KMP algorithm again using lecture note. What she likes to do is to go over the lecture | |
notes word by word, and then focus on thinking process. | |
She started to review KMP using Chinese blog, but she found out that the blog is not clear, so she likes to | |
follow up with more detail lecture note. | |
She likes to write and also she likes to read good writing of the KMP algorithm. | |
https://www.ics.uci.edu/~eppstein/161/960227.html | |
Knuth-Morris-Pratt string matching | |
The problem: given a (short) pattern and a (long) text, both strings, determine whether the pattern appears | |
somewhere in the text. Last time we saw how to do this with finite automata. This time we'll go through the | |
Knuth-Morris-Pratt (KMP) algorithm, which can be thought of as an efficient way to build these automata. I | |
also have some working C++ source code which might help you understand the algorithm better. | |
First let's look at a naive solution. | |
suppose the text is in an array: char T[n] | |
and the pattern is in another array: char P[m]. | |
One simple method is just to try each possible position the pattern could appear in the text. | |
Naive string matching: | |
for (i=0; T[i] != '\0'; i++) | |
{ | |
for (j=0; T[i+j] != '\0' && P[j] != '\0' && T[i+j]==P[j]; j++) ; | |
if (P[j] == '\0') | |
found a match | |
} | |
There are two nested loops; the inner one takes O(m) iterations and the outer one takes O(n) | |
iterations so the total time is the product, O(mn). This is slow; we'd like to speed it up. | |
In practice this works pretty well -- not usually as bad as this O(mn) worst case analysis. This | |
is because the inner loop usually finds a mismatch quickly and move on to the next position without | |
going through all m steps. But this method still can take O(mn) for some inputs. In one bad | |
example, all characters in T[] are "a"s, and P[] is all "a"'s except for one "b" at the end. | |
Then it takes m comparisons each time to discover that you don't have a match, so mn overall. | |
Here's a more typical example. Each row represents an iteration of the outer loop, with each | |
character in the row representing the result of a comparison (X if the comparison was unequal). | |
Suppose we're looking for pattern "nano" in text "banananobano". | |
0 1 2 3 4 5 6 7 8 9 10 11 | |
T: b a n a n a n o b a n o | |
i=0: X | |
i=1: X | |
i=2: n a n X | |
i=3: X | |
i=4: n a n o | |
i=5: X | |
i=6: n X | |
i=7: X | |
i=8: X | |
i=9: n X | |
i=10: X | |
Some of these comparisons are wasted work! For instance, after iteration i = 2, we know from the | |
comparisons we've done that T[3] = "a", so there is no point comparing it to "n" in iteration i = 3. | |
And we also know that T[4] = "n", so there is no point making the same comparison in iteration i = 4. | |
Skipping outer iterations | |
The Knuth-Morris-Pratt idea is, in this sort of situation, after you've invested a lot of work making | |
comparisons in the inner loop of the code, you know a lot about what's in the text. | |
Specifically, if you've found a partial match of j characters starting at position i, you know what's | |
in positions T[i]...T[i+j-1]. You can use this knowledge to save work in two ways. First, you can skip | |
some iterations for which no match is possible. Try overlapping the partial match you've found with | |
the new match you want to find: | |
i=2: n a n | |
i=3: n a n o | |
Here the two placements of the pattern conflict with each other -- we know from the i = 2 iteration | |
that T[3] and T[4] are "a" and "n", so they can't be the "n" and "a" that the i = 3 iteration is | |
looking for. We can keep skipping positions until we find one that doesn't conflict: | |
i=2: n a n | |
i=4: n a n o | |
Here the two "n"'s coincide. Define the overlap of two strings x and y to be the longest word that's | |
a suffix of x and a prefix of y. Here the overlap of "nan" and "nano" is just "n". (We don't allow | |
the overlap to be all of x or y, so it's not "nan"). In general the value of i we want to skip to | |
is the one corresponding to the largest overlap with the current partial match: | |
String matching with skipped iterations: | |
i=0; | |
while (i<n) | |
{ | |
for (j=0; T[i+j] != '\0' && P[j] != '\0' && T[i+j]==P[j]; j++) ; | |
if (P[j] == '\0') found a match; | |
i = i + max(1, j-overlap(P[0..j-1],P[0..m])); | |
} | |
Skipping inner iterations | |
The other optimization that can be done is to skip some iterations in the inner loop. Let's | |
look at the same example, in which we skipped from i = 2 to i = 4: | |
i=2: n a n | |
i=4: n a n o | |
In this example, the "n" that overlaps has already been tested by the i = 2 iteration. There's | |
no need to test it again in the i = 4 iteration. In general, if we have a nontrivial overlap | |
with the last partial match, we can avoid testing a number of characters equal to the length | |
of the overlap. | |
This change produces (a version of) the KMP algorithm: | |
KMP, version 1: | |
i=0; | |
o=0; | |
while (i<n) | |
{ | |
for (j=o; T[i+j] != '\0' && P[j] != '\0' && T[i+j]==P[j]; j++) ; | |
if (P[j] == '\0') found a match; | |
o = overlap(P[0..j-1],P[0..m]); | |
i = i + max(1, j-o); | |
} | |
The only remaining detail is how to compute the overlap function. This is a function only of | |
j, and not of the characters in T[], so we can compute it once in a preprocessing stage before | |
we get to this part of the algorithm. First let's see how fast this algorithm is. | |
KMP time analysis | |
We still have an outer loop and an inner loop, so it looks like the time might still be O(mn). | |
But we can count it a different way to see that it's actually always less than that. The idea | |
is that every time through the inner loop, we do one comparison T[i+j]==P[j]. We can count the | |
total time of the algorithm by counting how many comparisons we perform. | |
We split the comparisons into two groups: those that return true, and those that return false. | |
If a comparison returns true, we've determined the value of T[i+j]. Then in future iterations, | |
as long as there is a nontrivial overlap involving T[i+j], we'll skip past that overlap and | |
not make a comparison with that position again. So each position of T[] is only involved in | |
one true comparison, and there can be n such comparisons total. On the other hand, there is at | |
most one false comparison per iteration of the outer loop, so there can also only be n of those. | |
As a result we see that this part of the KMP algorithm makes at most 2n comparisons and takes | |
time O(n). | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment