I'll use x = "processing"
as an example. Throughout the example *
represents the null character, which sorts before any other character (i.e. * < a < b < c < ...
).
We start by creating three groups of suffixes - S0
, S1
and S2
- so that each suffix in the group Sk
starts at the index 3q + k
for some value of q
. In plain English:
S0
suffixes start at positions 0, 3, 6, etc.S1
suffixes start at positions 1, 4, 7, etc.S2
suffixes start at positions 2, 5, 8, etc.
The groups for "processing"
are:
S0 = 0: processing
3: cessing
6: sing
9: g
S1 = 1: rocessing
4: essing
7: ing
S2 = 2: ocessing
5: ssing
8: ng
We then construct S12
, which is made up of the 3-grams of suffixes from S1
and S2
. This is shown below, with the starting indexes above each 3-gram (this will be helpful later).
1 4 7 2 5 8
S12 = roc | ess | ing || oce | ssi | ng*
Because the elements of S12
are all 3 characters long, we can sort them in linear time using a bucket/radix sort and rename each 3-letter block with it's rank, as follows:
1 4 7 2 5 8
S12 = roc | ess | ing || oce | ssi | ng*
4 0 1 3 5 2
The renamed S12
is now just "401352"
, which we can use as the ranks for suffixes in S1
and S2
, meaning we have sorted those parts of the whole list. It contains no duplicates so no recursive step is needed; if there were duplicates here, we would apply the entire algorithm again to that string until we can uniquely sort the suffixes in S1
and S2
. Because the problem shrinks with each recursion, the algorithm remains linear (see recursion statement at the end).
Now that S1
and S2
are sorted, we can use them to sort the elements of S0
by replacing each suffix with a pair made up of it's first letter and the rank of it's remainder (denoted R(...)
below). The ranks of the remainder can be read from the result above (which is why it is useful to write the start indexes of the 3-grams in S12
). By re-mapping suffixes to pairs, we can sort S0
in linear time. This is done as follows:
Rank
0: processing -> (p, R(rocessing)) -> (p, R(1)) -> (p, 4) -> 2
3: cessing -> (c, R(essing)) -> (c, R(4)) -> (c, 0) -> 0
6: sing -> (s, R(ing)) -> (s, R(7)) -> (s, 1) -> 3
9: g -> (g, R(*)) -> (g, R(10)) -> (g, *) -> 1
We've now sorted S1
and S2
into one list and S0
into another list, both in linear time. If we can merge them in linear time, we have a linear algorithm. We can do this by replacing chunks of suffixes with the ranks of those chunks to create pairs and triples, just as we did to sort S0
.
In the table below, the top row gives the starting indexes (i
) of the sorted suffixes in S0
first, then the starting indexes of the sorted suffixes in S1
and S2
(both created by just reading from earlier results). The next two rows show the pairs (1c
) and triples (2c
) that will be used to compare elements in constant time.
Pairs are formed by reading the first character of a suffix and the rank of the rest of it; triples are formed by reading the first two characters and the rank of whatever is left. Different items have a pair, triple or both based on the following properties:
- Suffixes from
S0
can form a pair by reading one character and a rank fromS1
and a triple by reading two characters and a rank fromS2
. - Suffixes from
S1
can form a pair by reading one character and a rank fromS2
. - Suffixes from
S2
can form a triple by reading two characters and a rank fromS1
.
We always want to read ranks from S1
or S2
so that comparisons are consistent.
| i | 3 | 9 | 0 | 6 || 4 | 7 | 8 | 2 | 1 | 5 |
|-----|-----------------------||-----------------------------------|
| 1c | c0 g* p4 s1 || e5 i2 r3 |
| 2c | ce5 g** pr3 si2 || ng* oc0 ss1 |
Now that each suffix can be represented as a pair or triple we can compare them in constant time, so the lists can be merged easily:
- Comparing 3 vs. 4:
c0
"beats"e5
, soSA[0] = 3
. - Comparing 9 vs. 4:
e5
"beats"g*
, soSA[1] = 4
. - Comparing 9 vs. 7:
g*
"beats"i2
, soSA[2] = 9
. - etc.
After the lists are fully merged, we get the following result:
SA = 3 (cessing)
4 (essing)
9 (g)
7 (ing)
8 (ng)
2 (ocessing)
0 (processing)
1 (rocessing)
6 (sing)
5 (ssing)
The running time of the algorithm is in the form T(n) = T(2n/3) + O(n)
which resolves to O(n)
.