Skip to content

Instantly share code, notes, and snippets.

@Jangwa
Last active January 4, 2016 19:49
Show Gist options
  • Save Jangwa/8670289 to your computer and use it in GitHub Desktop.
Save Jangwa/8670289 to your computer and use it in GitHub Desktop.
KMP
The Naive pattern searching algorithm doesn’t work well in cases where we see many matching characters followed by
a mismatching character. Following are some examples.
txt[] = "AAAAAAAAAAAAAAAAAB"
txt[] = "ABABABCABABABCABABABC"
lps[i] = the longest proper prefix of pat[0..i] Searching Algorithm:Preprocessing Algorithm:
if (pattern[0] == '\0') /* Construct the lookup table */ /* Perform the search */ free(T);
pat[] = "AAAAB"
pat[] = "ABABAC" (not a worst case, but a bad case for Naive)
which is also a suffix of pat[0..i].
Unlike the Naive algo where we slide the pattern by one, we use a value from lps[] to decide the next sliding position.
Let us see how we do that. When we compare pat[j] with txt[i] and see a mismatch, we know that characters pat[0..j-1] match
with txt[i-j+1...i-1], and we also know that lps[j-1] characters of pat[0...j-1] are both proper prefix and suffix which
means we do not need to match these lps[j-1] characters with txt[i-j...i-1] because we know that these characters will
anyway match. See KMPSearch() in the below code for details. In the preprocessing part, we calculate values in lps[]. To do
that, we keep track of the length of the longest prefix suffix value (we use len variable for this purpose) for the previous
The KMP matching algorithm uses degenerating property (pattern having same sub-patterns appearing more than once in the
pattern) of the pattern and improves the worst case complexity to O(n). The basic idea behind KMP’s algorithm is: whenever
we detect a mismatch (after some matches), we already know some of the characters in the text (since they matched the pattern
characters prior to the mismatch). We take advantage of this information to avoid matching the characters that we know will
anyway match. KMP algorithm does some preprocessing over the pattern pat[] and constructs an auxiliary array lps[] of size m
(same as size of pattern). For each sub-pattern pat[0…i] where i = 0 to m-1, it calculates the length of maximum matching
proper prefix which is also a suffix of the sub-pattern pat[0..i] and stores the value in lps[i].
Examples:
For the pattern “AABAACAABAA”, lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
For the pattern “ABCDE”, lps[] is [0, 0, 0, 0, 0]
For the pattern “AAAAA”, lps[] is [0, 1, 2, 3, 4]
For the pattern “AAABAAA”, lps[] is [0, 1, 2, 0, 1, 2, 3]
For the pattern “AAACAAAAAC”, lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]
------------------------------------------------------------------------------------------------------
#define MAX_N 100010
char T[MAX_N], P[MAX_N]; // T = text, P = pattern
int b[MAX_N], n, m; // b = back table, n = length of T, m = length of P
void kmpPreprocess()
{
// call this before calling kmpSearch()
int i = 0, j = -1; b[0] = -1; // starting values
while (i < m)
{ // pre-process the pattern string P
while (j >= 0 && P[i] != P[j])
j = b[j]; // if different, reset j using b
i++;j++; // if same, advance both pointers
b[i] = j;
}
}
void kmpSearch()
{ // this is similar as kmpPreprocess(), but on string T
int i = 0, j = 0; // starting values
while (i < n)
{ // search through string T
while (j >= 0 && T[i] != P[j])
j = b[j]; // if different, reset j using b
i++; j++; // if same, advance both pointers
if (j == m)
{ // a match found when j == m
printf("P is found at index %d in T\n", i - j);
j = b[j]; // prepare j for the next possible match
}
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment