Created
October 14, 2010 17:46
-
-
Save ameerkat/626643 to your computer and use it in GitHub Desktop.
Boyer Moore string search implementation in Python
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Boyer Moore String Search implementation in Python | |
# Ameer Ayoub <[email protected]> | |
# Generate the Bad Character Skip List | |
def generateBadCharShift(term): | |
skipList = {} | |
for i in range(0, len(term)-1): | |
skipList[term[i]] = len(term)-i-1 | |
return skipList | |
# Generate the Good Suffix Skip List | |
def findSuffixPosition(badchar, suffix, full_term): | |
for offset in range(1, len(full_term)+1)[::-1]: | |
flag = True | |
for suffix_index in range(0, len(suffix)): | |
term_index = offset-len(suffix)-1+suffix_index | |
if term_index < 0 or suffix[suffix_index] == full_term[term_index]: | |
pass | |
else: | |
flag = False | |
term_index = offset-len(suffix)-1 | |
if flag and (term_index <= 0 or full_term[term_index-1] != badchar): | |
return len(full_term)-offset+1 | |
def generateSuffixShift(key): | |
skipList = {} | |
buffer = "" | |
for i in range(0, len(key)): | |
skipList[len(buffer)] = findSuffixPosition(key[len(key)-1-i], buffer, key) | |
buffer = key[len(key)-1-i] + buffer | |
return skipList | |
# Actual Search Algorithm | |
def BMSearch(haystack, needle): | |
goodSuffix = generateSuffixShift(needle) | |
badChar = generateBadCharShift(needle) | |
i = 0 | |
while i < len(haystack)-len(needle)+1: | |
j = len(needle) | |
while j > 0 and needle[j-1] == haystack[i+j-1]: | |
j -= 1 | |
if j > 0: | |
badCharShift = badChar.get(haystack[i+j-1], len(needle)) | |
goodSuffixShift = goodSuffix[len(needle)-j] | |
if badCharShift > goodSuffixShift: | |
i += badCharShift | |
else: | |
i += goodSuffixShift | |
else: | |
return i | |
return -1 | |
if __name__ == "__main__": | |
block = "This is a simple example" | |
print "This is an example search on the string \"", block, "\"." | |
print "ple :", BMSearch(block, "ple ") | |
print "example :", BMSearch(block, "example") | |
print "simple :", BMSearch(block, "simple") | |
print " imple :", BMSearch(block, " imple") |
@Abusagit thank you so much for the reply, the links are really good, Thank you for providing us with your implementation also and hopefully this will help others too :)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello, I have recently started to deploy my efforts in the pattern recognition problem field and wondered if I could implement not only original Boyer-Moore-Horspool algorith, but its tubo-verion. There were some unsuccessful tries because of good suffix heuristic too. Hell yeah, it`s much more complicated to understand.
Thanks to these resources:
Turbo-BMH
Great explanation
Explanation too
Despite all work has been done before me and can be seen on those articles, it was quite hard to understand how it works even with step-by-step revision of code and schemes though.
I used bad character heuristic without internal function in my code, good character heuristic was implemented by 2 consequent functions which calculate length of suffix when pattern is partially recognised, and then - (really wanted to blame myself in time consuming, it made me spend an evening in learning it) - turbo-shift heuristic. I didn`t find anything related to this topic written on Python, nevertheless, it was easy to synthetize a code by watching websites I mentioned.
Thus, I got it in my gist
So, It passed tests and I`m glad for now, if you are interested and it can help you or someone else, it will be very nice