Last active
January 4, 2016 19:49
-
-
Save Jangwa/8670289 to your computer and use it in GitHub Desktop.
KMP
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The Naive pattern searching algorithm doesn’t work well in cases where we see many matching characters followed by | |
a mismatching character. Following are some examples. | |
txt[] = "AAAAAAAAAAAAAAAAAB" | |
txt[] = "ABABABCABABABCABABABC" | |
lps[i] = the longest proper prefix of pat[0..i] Searching Algorithm:Preprocessing Algorithm: | |
if (pattern[0] == '\0') /* Construct the lookup table */ /* Perform the search */ free(T); | |
pat[] = "AAAAB" | |
pat[] = "ABABAC" (not a worst case, but a bad case for Naive) | |
which is also a suffix of pat[0..i]. | |
Unlike the Naive algo where we slide the pattern by one, we use a value from lps[] to decide the next sliding position. | |
Let us see how we do that. When we compare pat[j] with txt[i] and see a mismatch, we know that characters pat[0..j-1] match | |
with txt[i-j+1...i-1], and we also know that lps[j-1] characters of pat[0...j-1] are both proper prefix and suffix which | |
means we do not need to match these lps[j-1] characters with txt[i-j...i-1] because we know that these characters will | |
anyway match. See KMPSearch() in the below code for details. In the preprocessing part, we calculate values in lps[]. To do | |
that, we keep track of the length of the longest prefix suffix value (we use len variable for this purpose) for the previous | |
The KMP matching algorithm uses degenerating property (pattern having same sub-patterns appearing more than once in the | |
pattern) of the pattern and improves the worst case complexity to O(n). The basic idea behind KMP’s algorithm is: whenever | |
we detect a mismatch (after some matches), we already know some of the characters in the text (since they matched the pattern | |
characters prior to the mismatch). We take advantage of this information to avoid matching the characters that we know will | |
anyway match. KMP algorithm does some preprocessing over the pattern pat[] and constructs an auxiliary array lps[] of size m | |
(same as size of pattern). For each sub-pattern pat[0…i] where i = 0 to m-1, it calculates the length of maximum matching | |
proper prefix which is also a suffix of the sub-pattern pat[0..i] and stores the value in lps[i]. | |
Examples: | |
For the pattern “AABAACAABAA”, lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5] | |
For the pattern “ABCDE”, lps[] is [0, 0, 0, 0, 0] | |
For the pattern “AAAAA”, lps[] is [0, 1, 2, 3, 4] | |
For the pattern “AAABAAA”, lps[] is [0, 1, 2, 0, 1, 2, 3] | |
For the pattern “AAACAAAAAC”, lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4] | |
------------------------------------------------------------------------------------------------------ | |
#define MAX_N 100010 | |
char T[MAX_N], P[MAX_N]; // T = text, P = pattern | |
int b[MAX_N], n, m; // b = back table, n = length of T, m = length of P | |
void kmpPreprocess() | |
{ | |
// call this before calling kmpSearch() | |
int i = 0, j = -1; b[0] = -1; // starting values | |
while (i < m) | |
{ // pre-process the pattern string P | |
while (j >= 0 && P[i] != P[j]) | |
j = b[j]; // if different, reset j using b | |
i++;j++; // if same, advance both pointers | |
b[i] = j; | |
} | |
} | |
void kmpSearch() | |
{ // this is similar as kmpPreprocess(), but on string T | |
int i = 0, j = 0; // starting values | |
while (i < n) | |
{ // search through string T | |
while (j >= 0 && T[i] != P[j]) | |
j = b[j]; // if different, reset j using b | |
i++; j++; // if same, advance both pointers | |
if (j == m) | |
{ // a match found when j == m | |
printf("P is found at index %d in T\n", i - j); | |
j = b[j]; // prepare j for the next possible match | |
} | |
} | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment