The Doubling Algorithm for Suffix Array Construction
I'll use x = "banana", m = 6
as an example. The suffixes are:
0: banana
1: anana
2: nana
etc.
First, we'll work out R1
, which is the rank of each suffix according to it's first letter only. Because we're just comparing single letters, we can do it in Theta(m)
with a bucket sort. There will be some duplicates, and that's fine. We only have three different letters, so we give all of the A's a rank of 0, all of the B's a rank of 1, and all of the N's a rank of 2. We end up with this:
R1
0: 1
1: 0
2: 2
3: 0
4: 2
5: 0
Now, we'll work out R2
, which is the rank of the suffixes according to their first two letters. Instead of comparing actual characters, we can just use the ranks we figured out in the last round, using the "formula" below:
R2k[i] = rank of the pair (Rk[i], Rk[i + k])
For building R2
from R1
, this means:
R2[i] = rank of the pair (R1[i], R1[i + 1])
What this states is that the R2
rank of suffix 0 is the same as the rank of the pair (R1(0), R1(1))
. We're going to bucket sort according to the rank of R1(0)
(i.e. "banana"), then bucket sort again within each set of matching R1(0)
values according to the rank of R1(1)
(i.e. "anana") to break ties.
We'll end up with this:
R1 R2
0: 1 1,0 -> 2
1: 0 0,2 -> 1
2: 2 2,0 -> 3
3: 0 0,2 -> 1
4: 2 2,0 -> 3
5: 0 0,* -> 0
Note that the last position has *
to indicate that there is no "next rank" to sort with. That's fine, because *
"beats" everything, meaning the single "a" at prefix 5 will sort before the "ana" at prefix 3, even though their pairs both start with 0.
We can continue for a third round, moving up to R4
this time and using the R2
ranks (this is why it's called the "doubling" method). We'll use the same method as before:
R4[i] = rank of the pair (R2[i], R2[i + 2])
That gives us this:
R1 R2 R4
0: 1 1,0 -> 2 2,3 -> 3
1: 0 0,2 -> 1 1,1 -> 2
2: 2 2,0 -> 3 3,3 -> 5
3: 0 0,2 -> 1 1,0 -> 1
4: 2 2,0 -> 3 3,* -> 4
5: 0 0,* -> 0 0,* -> 0
Now we have no duplicates, so we can stop. Reading right to left, we can construct the suffix (rank 0 is given to position 5, rank 1 is given to position 3, etc).
The final output:
SA = 5 (a)
3 (ana)
1 (anana)
0 (banana)
4 (na)
2 (nana)
In this case, sorting according to the first 4 characters was enough to create a unique rank for each suffix. In the worst case, we'd need to keep going for O(log m)
rounds to sort according to all m
letters. The bucket sort at each iteration runs in Theta(m)
, so the overall running time is O(m * log m)
.
Hi, thanks for the nice tutorial!
However, I found two possible mistakes.
The first one lies in the formula of calculating the rank of the suffix
Is it:
This will make R2 and R4 more clear.
eg. When k=2 (R4=R2^2):
Another possible mistake is:
The number in [] should be 3.
Still thanks again for the tutorial! :)