Created
March 12, 2012 21:48
-
-
Save icheishvili/2024914 to your computer and use it in GitHub Desktop.
Faster automata-style string-matching
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
def match(needle, haystack): | |
"""Perform finite-automata style string matching.""" | |
matches = [] | |
nposition = 0 | |
for index, char in enumerate(haystack): | |
if needle[nposition] == char: | |
nposition += 1 | |
if nposition == len(needle): | |
matches.append((index - len(needle) + 1, index + 1)) | |
nposition = 0 | |
elif needle[0] == char: | |
nposition = 1 | |
else: | |
nposition = 0 | |
return matches | |
def main(): | |
"""Run the example program.""" | |
haystack = ''' | |
In computer science, the Boyer-Moore string search algorithm is a | |
particularly efficient string searching algorithm, and it has been | |
the standard benchmark for the practical string search literature. | |
It was developed by Bob Boyer and J Strother Moore in 1977. The | |
algorithm preprocesses the target string (key) that is being | |
searched for, but not the string being searched in (unlike some | |
algorithms that preprocess the string to be searched and can then | |
amortize the expense of the preprocessing by searching repeatedly). | |
The execution time of the Boyer-Moore algorithm, while still linear | |
in the size of the string being searched, can have a significantly | |
lower constant factor than many other search algorithms: it doesn't | |
need to check every character of the string to be searched, but | |
rather skips over some of them. Generally the algorithm gets faster | |
as the key being searched for becomes longer. Its efficiency derives | |
from the fact that with each unsuccessful attempt to find a match | |
between the search string and the text it is searching, it uses the | |
information gained from that attempt to rule out as many positions | |
of the text as possible where the string cannot match. | |
''' | |
needle = 'Boyer' | |
matches = match(needle, haystack) | |
for match_start, match_end in matches: | |
print haystack[match_start:match_end] | |
if __name__ == '__main__': | |
main() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment