-
-
Save ameerkat/626643 to your computer and use it in GitHub Desktop.
# Boyer Moore String Search implementation in Python | |
# Ameer Ayoub <[email protected]> | |
# Generate the Bad Character Skip List | |
def generateBadCharShift(term): | |
skipList = {} | |
for i in range(0, len(term)-1): | |
skipList[term[i]] = len(term)-i-1 | |
return skipList | |
# Generate the Good Suffix Skip List | |
def findSuffixPosition(badchar, suffix, full_term): | |
for offset in range(1, len(full_term)+1)[::-1]: | |
flag = True | |
for suffix_index in range(0, len(suffix)): | |
term_index = offset-len(suffix)-1+suffix_index | |
if term_index < 0 or suffix[suffix_index] == full_term[term_index]: | |
pass | |
else: | |
flag = False | |
term_index = offset-len(suffix)-1 | |
if flag and (term_index <= 0 or full_term[term_index-1] != badchar): | |
return len(full_term)-offset+1 | |
def generateSuffixShift(key): | |
skipList = {} | |
buffer = "" | |
for i in range(0, len(key)): | |
skipList[len(buffer)] = findSuffixPosition(key[len(key)-1-i], buffer, key) | |
buffer = key[len(key)-1-i] + buffer | |
return skipList | |
# Actual Search Algorithm | |
def BMSearch(haystack, needle): | |
goodSuffix = generateSuffixShift(needle) | |
badChar = generateBadCharShift(needle) | |
i = 0 | |
while i < len(haystack)-len(needle)+1: | |
j = len(needle) | |
while j > 0 and needle[j-1] == haystack[i+j-1]: | |
j -= 1 | |
if j > 0: | |
badCharShift = badChar.get(haystack[i+j-1], len(needle)) | |
goodSuffixShift = goodSuffix[len(needle)-j] | |
if badCharShift > goodSuffixShift: | |
i += badCharShift | |
else: | |
i += goodSuffixShift | |
else: | |
return i | |
return -1 | |
if __name__ == "__main__": | |
block = "This is a simple example" | |
print "This is an example search on the string \"", block, "\"." | |
print "ple :", BMSearch(block, "ple ") | |
print "example :", BMSearch(block, "example") | |
print "simple :", BMSearch(block, "simple") | |
print " imple :", BMSearch(block, " imple") |
!!Update: This code is not accurate, needs modification!!
Hello thank you for this,
As the previous person pointed out there was a slight issue with the algorithm
I fixed the BMSearch function, this is what I did:
Actual Search Algorithm
def BMSearch(source, pattern):
goodSuffix = generateSuffixShift(pattern)
badChar = generateBadCharShift(pattern)
i = len(pattern) - 1
while i < len(source)-1:
j = 0
while j < len(pattern) and pattern[len(pattern) - j - 1] == source[i - j - 1]:
j += 1
if j == len(pattern):
#pattern has been found here
#can be replaced by a counter
return i - len(pattern)
else:
t = badChar.get(source[i - j - 1])
if t is None:
t = 6
k = goodSuffix.get(j)
if k is None:
k = 0
d1 = t - j
i += max(d1, k)
return -1
@dhritix1999 your search algorithm does not seem to be correct. It is outputting -1 on "example".
@dhritix1999 your search algorithm does not seem to be correct. It is outputting -1 on "example".
Hello @kartikeypro, I had forgotten to update this thread but I will clarify it here. Thank you for pointing it out, after testing it out several times I noticed that my code is also not accurate. The problem is that the previous parts of the code have issues also, the original authors
BadCharacterShift() is correct, rest all are wrong. It can be fixed I suppose but I spent a lot of time fixing it and even rewriting the whole algorithm from scratch but the good character shift is the hardest to get right. I gave up because I was spending too much time on it. (This was an optional part of my project)
If you would like help on it please reach out to me, I could provide some help or my last draft of it
PS. updated my original thread.
Hello, I have recently started to deploy my efforts in the pattern recognition problem field and wondered if I could implement not only original Boyer-Moore-Horspool algorith, but its tubo-verion. There were some unsuccessful tries because of good suffix heuristic too. Hell yeah, it`s much more complicated to understand.
Thanks to these resources:
Turbo-BMH
Great explanation
Explanation too
Despite all work has been done before me and can be seen on those articles, it was quite hard to understand how it works even with step-by-step revision of code and schemes though.
I used bad character heuristic without internal function in my code, good character heuristic was implemented by 2 consequent functions which calculate length of suffix when pattern is partially recognised, and then - (really wanted to blame myself in time consuming, it made me spend an evening in learning it) - turbo-shift heuristic. I didn`t find anything related to this topic written on Python, nevertheless, it was easy to synthetize a code by watching websites I mentioned.
Thus, I got it in my gist
So, It passed tests and I`m glad for now, if you are interested and it can help you or someone else, it will be very nice
@Abusagit thank you so much for the reply, the links are really good, Thank you for providing us with your implementation also and hopefully this will help others too :)
find a problem with you implementation... and I don't know what is wrong.
trying these as the haystack:
671111091113298117115999711432117110973297103117106973210111032117110321129710697114
and these as the needle:
11110911132
and it works... but remove the last 2 numbers from that needle, and it doesn’t ( as in 111109111)
I haven’t found another case in which something similar happens.