Skip to content

Instantly share code, notes, and snippets.

@ameerkat
Created October 14, 2010 17:46
Show Gist options
  • Save ameerkat/626643 to your computer and use it in GitHub Desktop.
Save ameerkat/626643 to your computer and use it in GitHub Desktop.
Boyer Moore string search implementation in Python
# Boyer Moore String Search implementation in Python
# Ameer Ayoub <[email protected]>
# Generate the Bad Character Skip List
def generateBadCharShift(term):
skipList = {}
for i in range(0, len(term)-1):
skipList[term[i]] = len(term)-i-1
return skipList
# Generate the Good Suffix Skip List
def findSuffixPosition(badchar, suffix, full_term):
for offset in range(1, len(full_term)+1)[::-1]:
flag = True
for suffix_index in range(0, len(suffix)):
term_index = offset-len(suffix)-1+suffix_index
if term_index < 0 or suffix[suffix_index] == full_term[term_index]:
pass
else:
flag = False
term_index = offset-len(suffix)-1
if flag and (term_index <= 0 or full_term[term_index-1] != badchar):
return len(full_term)-offset+1
def generateSuffixShift(key):
skipList = {}
buffer = ""
for i in range(0, len(key)):
skipList[len(buffer)] = findSuffixPosition(key[len(key)-1-i], buffer, key)
buffer = key[len(key)-1-i] + buffer
return skipList
# Actual Search Algorithm
def BMSearch(haystack, needle):
goodSuffix = generateSuffixShift(needle)
badChar = generateBadCharShift(needle)
i = 0
while i < len(haystack)-len(needle)+1:
j = len(needle)
while j > 0 and needle[j-1] == haystack[i+j-1]:
j -= 1
if j > 0:
badCharShift = badChar.get(haystack[i+j-1], len(needle))
goodSuffixShift = goodSuffix[len(needle)-j]
if badCharShift > goodSuffixShift:
i += badCharShift
else:
i += goodSuffixShift
else:
return i
return -1
if __name__ == "__main__":
block = "This is a simple example"
print "This is an example search on the string \"", block, "\"."
print "ple :", BMSearch(block, "ple ")
print "example :", BMSearch(block, "example")
print "simple :", BMSearch(block, "simple")
print " imple :", BMSearch(block, " imple")
@joruzani
Copy link

find a problem with you implementation... and I don't know what is wrong.

trying these as the haystack:
671111091113298117115999711432117110973297103117106973210111032117110321129710697114

and these as the needle:

11110911132

and it works... but remove the last 2 numbers from that needle, and it doesn’t ( as in 111109111)

I haven’t found another case in which something similar happens.

@dhritix1999
Copy link

dhritix1999 commented Nov 16, 2020

!!Update: This code is not accurate, needs modification!!

Hello thank you for this,
As the previous person pointed out there was a slight issue with the algorithm
I fixed the BMSearch function, this is what I did:

Actual Search Algorithm

def BMSearch(source, pattern):
    goodSuffix = generateSuffixShift(pattern)
    badChar = generateBadCharShift(pattern)
    i = len(pattern) - 1
    while i < len(source)-1:
        j = 0
        while j < len(pattern) and pattern[len(pattern) - j - 1] == source[i - j - 1]:
            j += 1
        if j == len(pattern):
            #pattern has been found here
            #can be replaced by a counter
            return i - len(pattern)
        else:
            t = badChar.get(source[i - j - 1])
            if t is None:
                t = 6
            k = goodSuffix.get(j)
            if k is None:
                k = 0

            d1 = t - j

            i += max(d1, k)
    return -1

@kartikeypro
Copy link

@dhritix1999 your search algorithm does not seem to be correct. It is outputting -1 on "example".

@dhritix1999
Copy link

@dhritix1999 your search algorithm does not seem to be correct. It is outputting -1 on "example".

Hello @kartikeypro, I had forgotten to update this thread but I will clarify it here. Thank you for pointing it out, after testing it out several times I noticed that my code is also not accurate. The problem is that the previous parts of the code have issues also, the original authors
BadCharacterShift() is correct, rest all are wrong. It can be fixed I suppose but I spent a lot of time fixing it and even rewriting the whole algorithm from scratch but the good character shift is the hardest to get right. I gave up because I was spending too much time on it. (This was an optional part of my project)

If you would like help on it please reach out to me, I could provide some help or my last draft of it

PS. updated my original thread.

@Abusagit
Copy link

Abusagit commented Mar 12, 2021

Hello, I have recently started to deploy my efforts in the pattern recognition problem field and wondered if I could implement not only original Boyer-Moore-Horspool algorith, but its tubo-verion. There were some unsuccessful tries because of good suffix heuristic too. Hell yeah, it`s much more complicated to understand.

Thanks to these resources:
Turbo-BMH
Great explanation
Explanation too

Despite all work has been done before me and can be seen on those articles, it was quite hard to understand how it works even with step-by-step revision of code and schemes though.

I used bad character heuristic without internal function in my code, good character heuristic was implemented by 2 consequent functions which calculate length of suffix when pattern is partially recognised, and then - (really wanted to blame myself in time consuming, it made me spend an evening in learning it) - turbo-shift heuristic. I didn`t find anything related to this topic written on Python, nevertheless, it was easy to synthetize a code by watching websites I mentioned.

Thus, I got it in my gist

So, It passed tests and I`m glad for now, if you are interested and it can help you or someone else, it will be very nice

@dhritix1999
Copy link

@Abusagit thank you so much for the reply, the links are really good, Thank you for providing us with your implementation also and hopefully this will help others too :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment