Created
May 24, 2013 19:40
-
-
Save Nicksil/5646003 to your computer and use it in GitHub Desktop.
Python - Knuth-Morris-Pratt (KMP) string-matching algorithm
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# File: KnuthMorrisPratt.py | |
# Author: Keith Schwarz ([email protected]) | |
# | |
# An implementation of the Knuth-Morris-Pratt (KMP) string-matching algorithm. | |
# This algorithm takes as input a pattern string P and target string T, then | |
# finds the first occurrence of the string T in the pattern P, doing so in time | |
# O(|P| + |T|). The logic behind the algorithm is not particularly complex, | |
# though getting it to run in linear time requires a few non-obvious tricks. | |
# | |
# To motivate KMP, consider the naive algorithm for trying to match a pattern | |
# string P against a target T. This would work by considering all possible | |
# start positions for the pattern P in the target T, then checking whether a | |
# match exists at each of those positions. For example, to match the string | |
# ABC against the target string ABABABACCABC, we'd get | |
# | |
# ABABABACCABC | |
# ABX (first two characters match, last does not) | |
# X (first character doesn't match) | |
# ABX (first two characters match, last does not) | |
# X (first character doesn't match) | |
# ABX (first two characters match, last does not) | |
# X (first character doesn't match) | |
# AX (first character matches, second doesn't) | |
# X (first character doesn't match) | |
# X (first character doesn't match) | |
# ABC (match found) | |
# | |
# This algorithm runs in O(mn) in the worst case, where m = |T| and n = |P|, | |
# because it has to do O(n) work to check whether the string matches O(m) times | |
# for each spot in the string. | |
# | |
# However, a lot of this is wasted work. For example, in the above example, | |
# consider what happens when we know that the string ABC does not match the | |
# first part of the string, ABA. At this point, it would be silly to actually | |
# try to match the string at the string starting with the B, since there's no | |
# possible way that the string could match there. Instead, it would make more | |
# sense to instead start over and try matching ABC at the next A. In fact, | |
# more generally, if we can use the information we have about what characters | |
# we already matched to determine where we should try to resume the search in | |
# the string, we can avoid revisiting characters multiple times when there's no | |
# hope that they could ever match. | |
# | |
# The idea we'll use is to look for "borders" of a string, which are substrings | |
# that are both a prefix and suffix of the string. For example, the string | |
# "aabcaa" has "aa" as a border, while the string "abc" just has the empty | |
# string as a border. Borders are useful in KMP because they encode | |
# information about where we might need to pick up the search when a particular | |
# match attempt fails. For example, suppose that we want to match ABABC | |
# against the string ABABABC. If we start off by trying to match the string, | |
# we'll find that they overlap like this: | |
# | |
# ABABABC | |
# ABABx | |
# | |
# That is, the first four characters match, but the fifth does not. At this | |
# point, rather than naively restarting the search at the second character (B), | |
# or even restarting it at the third position (A), we can instead note that we | |
# can treat the last two characters of the string we matched (AB) as the first | |
# two characters of the pattern string ABABC if we just treated it instead as | |
# though we had | |
# | |
# ABABABC | |
# ABABx | |
# ABABC | |
# | |
# If we can somehow remember the fact that we already matched the AB at the | |
# start of this string, we could just confirm that the three characters after | |
# it are ABC and be done. There's no need to confirm that the characters at | |
# the front match. | |
# | |
# In order to make this possible, we'll construct a special data structure | |
# called the "fail table." This table stores, for each possible prefix of the | |
# string to match, the length of the longest border of that prefix. That way, | |
# when we find a mismatch, we know where the next possible start location could | |
# be found. In particular, once we have a mismatch, if there's any border of | |
# the prefix of the pattern that we matched so far, then we can treat the end | |
# of that matching prefix as the start of a prefix of the word that occurs | |
# later in the target. | |
# | |
# The basic idea behind KMP is, given this table, to execute the following: | |
# | |
# - Guess that the string starts at the beginning of the target. | |
# - Match as much of the string as possible. | |
# - If the whole string matched, we're done. | |
# - Otherwise, a mismatch was found. Look up the largest border of the | |
# string that was matched so far in the failure table. Suppose it has | |
# length k. | |
# - Update our guess of the start position to be where that border occurs | |
# in the portion matched so far, then repeat this process. | |
# | |
# Notice that once we've matched a character against part of the pattern (or | |
# found that it can't possibly match), we never visit that character again. | |
# This is responsible for the fast runtime of the algorithm (though I'll give a | |
# more formal description later on). | |
# Function: failTable(pattern) | |
# Usage: failTable("This is a string!") | |
# ----------------------------------------------------------------------------- | |
# Given a string, constructs the KMP failure table for that string. The values | |
# in the table are defined as | |
# | |
# table[i] = |LongestProperBoundary(pattern[0:i)])| | |
# | |
# Where the longest proper boundary of a string is the longest proper substring | |
# of that string that is both a prefix and a suffix. For example, given the | |
# string "abcabc," the longest proper boundary is abc. Similarly, given the | |
# string "apple," the longest proper boundary is the empty string. | |
# | |
# As a sample output of this function, given the string "ababcac", the table | |
# would be | |
# | |
# a b a b c a c | |
# * 0 0 1 2 0 1 0 | |
# | |
# This means, for example, that the longest proper boundary of the prefix "aba" | |
# has length 1, while the longest proper boundary of the string as a whole is | |
# the empty string. Notice that the first entry is *, which we have chosen | |
# because there is no mathematically well-defined proper substring of the empty | |
# string. We can put anything we want there, and we'll go with None. | |
# | |
# To compute the values of this table, we use a dynamic programming algorithm | |
# to compute a slightly stronger version of the function. We define the | |
# function "Extended Longest Proper Boundary" (xLPB) as follows: | |
# | |
# xLPB(string, n, char) = The longest proper boundary of string[0:n] + char | |
# | |
# The idea behind this function is that we want to be able to recycle the | |
# values of the longest proper boundary function for smaller prefixes of the | |
# string in order to compute the longest proper boundary for longer prefixes. | |
# To make this easier, the xLPB function allows us to talk about what would | |
# happen if we extended the longest proper boundary of some prefix of the | |
# string by a single character. Notice that for any nonzero n, we have that | |
# | |
# LongestProperBoundary(string[0:n]) = xLPB(string, n - 1, string[n]) | |
# | |
# That is, we simply tear off the last character and use it as the final | |
# argument to xLPB. Given this xLPB function, we can compute its values | |
# recursively using the following logic. As a base case, xLPB(string, 0, char) | |
# is the longest proper boundary of string[0:0] + char = char. But this has | |
# only one proper boundary, the empty string, and so its value must be zero. | |
# | |
# Now suppose that for all n' < n we have the value of xLPB(string, n', char) | |
# for any character char. Suppose we want to go and compute | |
# xLPB(string, n, char). Let's think about what this would mean. Given that | |
# n is not zero, we can think of this problem as trying to find the longest | |
# proper boundary of this string: | |
# | |
# +------------+---+------------+------------+---+ | |
# | LPB | ? | ... | LPB | c | | |
# +------------+---+------------+------------+---+ | |
# | |
# ^ ^ ^ | |
# +----------------------+-------------------+ | | |
# | | | |
# String of length n New character | |
# | |
# The idea is that we have the original string of length n, followed by our new | |
# character char (which we'll abbreviate c). In this diagram, I've marked the | |
# LPB of the string of length n. Notice that right after the LPB at the prefix | |
# of the string, we have some character whose value is unknown (since n != 0 | |
# and the LPB can't be the whole string). If this value is equal to c, then | |
# the LPB of the whole string can be formed by simply extending the LPB of the | |
# first n characters. There can't be a longer proper boundary, since otherwise | |
# we could show that by taking that longer boundary and dropping off the | |
# character c, we'd end up with a longer proper boundary for the first n | |
# characters of the string, contradicting that we chose the longest proper | |
# boundary. | |
# | |
# By our above argument, remember that the length of the longest proper | |
# boundary of the first n characters of the string is given by | |
# | |
# xLPB(string, n - 1, string[n - 1]) | |
# | |
# Thus we have the first part of our recurrence, which is defined as | |
# | |
# xLPB(string, n, char) = | |
# if n = 0, then 0. | |
# let k = xLPB(string, n - 1, string[n - 1]) | |
# if string[k] == char, return k + 1 | |
# else, ??? | |
# | |
# Now, suppose that we find that the character after the LPB does not match. | |
# If this happens, we can then make the following observation. Below I've | |
# reprinted the above diagram: | |
# | |
# +------------+---+------------+------------+---+ | |
# | LPB | ? | ... | LPB | c | | |
# +------------+---+------------+------------+---+ | |
# | |
# ^ ^ ^ | |
# +----------------------+-------------------+ | | |
# | | | |
# String of length n New character | |
# | |
# Notice that any LPB of this new string must be a prefix of the LPB of the | |
# first n characters and a suffix of the LPB followed by the character c. | |
# Since by definition the LPB of the first n characters must be a prefix of | |
# those n characters, we have the following elegant conclusion to our | |
# recurrence: | |
# | |
# xLPB(string, n, char) = | |
# if n = 0, then 0. | |
# let k = xLPB(string, n - 1, string[n - 1]) | |
# if string[k] == char, return k + 1 | |
# else, xLPB(string, k, char) | |
# | |
# The reason for this is that xLPB(string, k, char) asks for the longest | |
# proper boundary of the LPB of the string formed from the first n characters | |
# of the string followed by the character c, which is exactly what we described | |
# above. | |
# | |
# As written, filling in the table of LPB values would take O(n^2) time, where | |
# n is the length of the string. However, using dynamic programming and an | |
# amortized analysis, we can show that this function can be made to run in | |
# O(n) time. In particular, suppose that for all n' < n, we know the value of | |
# LPB(string[0:n]). Then in the above formulation of xLPB, the first | |
# recursive call is known, and the only recursive call we may actually need to | |
# make is the second. | |
# | |
# However, this doesn't seem to say anything about the runtime of the second | |
# recursive call, which seems as though it might cause the evaluation of this | |
# function to run in time O(n). This is correct, but in an *amortized* sense | |
# the whole table can still be computed in O(n) time overall. To see this, | |
# let's define a potential function Phi(k) that associates a potential at each | |
# point of the computation of the table. In particular, define Phi(k) as | |
# | |
# Phi(0) = 0 | |
# Phi(k + 1) = result[k - 1] | |
# | |
# Here, result is the resulting table of LPB values. Because of this, we can | |
# remark that result[k] < k, since the longest proper border of a string can't | |
# be any longer than that string. | |
# | |
# Let's now show that this potential function gives an amortized O(1) cost for | |
# each table entry computation, and thus an O(n) overall runtime for the table- | |
# building algorithm. To see this, consider what happens when the logic to | |
# compute the next value runs. The runtime for this step is bounded by the | |
# number of recursive calls made to a subproblem. However, each subproblem is | |
# then of size given by the LPB of a slightly smaller problem. This subproblem | |
# must then have size at most the size of that smaller subproblem. In other | |
# words, we can say that each recursive call drops the maximum possible value | |
# of the LPB for the current prefix by at least one. Consequently, if k | |
# recursive calls are made, the LPB of the current prefix is at least k smaller | |
# than the LPB of the previous prefix, and so | |
# | |
# D Phi = -k | |
# | |
# And so the amortized cost of computing the next term is 1 + k - k = O(1). | |
def failTable(pattern): | |
# Create the resulting table, which for length zero is None. | |
result = [None] | |
# Iterate across the rest of the characters, filling in the values for the | |
# rest of the table. | |
for i in range(0, len(pattern)): | |
# Keep track of the size of the subproblem we're dealing with, which | |
# starts off using the first i characters of the string. | |
j = i | |
while True: | |
# If j hits zero, the recursion says that the resulting value is | |
# zero since we're looking for the LPB of a single-character | |
# string. | |
if j == 0: | |
result.append(0) | |
break | |
# Otherwise, if the character one step after the LPB matches the | |
# next character in the sequence, then we can extend the LPB by one | |
# character to get an LPB for the whole sequence. | |
if pattern[result[j]] == pattern[i]: | |
result.append(result[j] + 1) | |
break | |
# Finally, if neither of these hold, then we need to reduce the | |
# subproblem to the LPB of the LPB. | |
j = result[j] | |
return result | |
# Function: kmpMatch(needle, haystack) | |
# Usage: print kmpMatch("0101", "0011001011") # Prints 5 | |
# ----------------------------------------------------------------------------- | |
# Uses the KMP algorithm to find an occurrence of the specified needle string | |
# in the haystack string. To do this, we compute the failure table, which | |
# is done above. Next, we iterate across the string, keeping track of a | |
# candidate start point and length matched so far. Whenever a match occurs, we | |
# update the length of the match we've made. On a failure, we update these | |
# values by trying to preserve the maximum proper border of the string we were | |
# able to manage by that point. | |
def kmpMatch(needle, haystack): | |
# Compute the failure table for the needle we're looking up. | |
fail = failTable(needle) | |
# Keep track of the start index and next match position, both of which | |
# start at zero since our candidate match is at the beginning and is trying | |
# to match the first character. | |
index = 0 | |
match = 0 | |
# Loop until we fall off the string or match. | |
while index + match < len(haystack): | |
print index, match | |
# If the current character matches the expected character, then bump up | |
# the match index. | |
if haystack[index + match] == needle[match]: | |
match = match + 1 | |
# If we completely matched everything, we're done. | |
if match == len(needle): | |
return index | |
# Otherwise, we need to look at the fail table to determine what to do | |
# next. | |
else: | |
# If we couldn't match the first character, then just advance the | |
# start index. We need to try again. | |
if match == 0: | |
index = index + 1 | |
# Otherwise, see how much we need to skip forward before we have | |
# another feasible match. | |
else: | |
index = index + match - fail[match] | |
match = fail[match] | |
# If we made it here, then no match was found. | |
return None |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment