The Doubling Algorithm for Suffix Array Construction

The Doubling Algorithm for Suffix Array Construction

I'll use x = "banana", m = 6 as an example. The suffixes are:

0:  banana
1:  anana
2:  nana
etc.

First, we'll work out R1, which is the rank of each suffix according to it's first letter only. Because we're just comparing single letters, we can do it in Theta(m) with a bucket sort. There will be some duplicates, and that's fine. We only have three different letters, so we give all of the A's a rank of 0, all of the B's a rank of 1, and all of the N's a rank of 2. We end up with this:

Now, we'll work out R2, which is the rank of the suffixes according to their first two letters. Instead of comparing actual characters, we can just use the ranks we figured out in the last round, using the "formula" below:

R2k[i] = rank of the pair (Rk[i], Rk[i + k])

For building R2 from R1, this means:

R2[i] = rank of the pair (R1[i], R1[i + 1])

What this states is that the R2 rank of suffix 0 is the same as the rank of the pair (R1(0), R1(1)). We're going to bucket sort according to the rank of R1(0) (i.e. "banana"), then bucket sort again within each set of matching R1(0) values according to the rank of R1(1) (i.e. "anana") to break ties.

We'll end up with this:

    R1  R2
0:  1   1,0 -> 2
1:  0   0,2 -> 1
2:  2   2,0 -> 3
3:  0   0,2 -> 1
4:  2   2,0 -> 3
5:  0   0,* -> 0

Note that the last position has * to indicate that there is no "next rank" to sort with. That's fine, because * "beats" everything, meaning the single "a" at prefix 5 will sort before the "ana" at prefix 3, even though their pairs both start with 0.

We can continue for a third round, moving up to R4 this time and using the R2 ranks (this is why it's called the "doubling" method). We'll use the same method as before:

R4[i] = rank of the pair (R2[i], R2[i + 2])

That gives us this:

    R1  R2         R4
0:  1   1,0 -> 2   2,3 -> 3
1:  0   0,2 -> 1   1,1 -> 2
2:  2   2,0 -> 3   3,3 -> 5
3:  0   0,2 -> 1   1,0 -> 1
4:  2   2,0 -> 3   3,* -> 4
5:  0   0,* -> 0   0,* -> 0

Now we have no duplicates, so we can stop. Reading right to left, we can construct the suffix (rank 0 is given to position 5, rank 1 is given to position 3, etc).

The final output:

SA = 5  (a)
     3  (ana)
     1  (anana)
     0  (banana)
     4  (na)
     2  (nana)

In this case, sorting according to the first 4 characters was enough to create a unique rank for each suffix. In the worst case, we'd need to keep going for O(log m) rounds to sort according to all m letters. The bucket sort at each iteration runs in Theta(m), so the overall running time is O(m * log m).

Hi @PhoebeWangintw, thanks for your comments.

R2k[i] is used because each rank is created by doubling the number, rather than squaring it. Elements of R2 are combined to make R4 (where doubling and squaring both work, because 2x2 = 2^2 = 4), but if you have to go further then elements of R4 are used to make R8, elements of R8 are used to make R16, etc.
Well spotted on the second point, you're absolutely right - I dug out my original paper version of these notes and that number was originally a 3 before typing them up, not a 2. I will fix that. Thankfully it doesn't change the result of the calculation in this example because the only other pair that would end up in the same first bucket is 3,*.

markormesher/doubling-algorithm.md

PhoebeWangintw commented Jan 15, 2020 •

edited

Loading

Uh oh!

markormesher commented Feb 22, 2020

Uh oh!

ywsoliman commented Jan 6, 2022

Uh oh!

markormesher commented Jan 6, 2022

Uh oh!

markormesher/doubling-algorithm.md

PhoebeWangintw commented Jan 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markormesher commented Feb 22, 2020

Uh oh!

ywsoliman commented Jan 6, 2022

Uh oh!

markormesher commented Jan 6, 2022

Uh oh!

PhoebeWangintw commented Jan 15, 2020 •

edited

Loading