Last active
August 29, 2015 14:04
-
-
Save rchikhi/20fa292238082130669a to your computer and use it in GitHub Desktop.
Illustration of Algorithm 1 in RopeBWT2 article
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This document is a partial presentation of the RopeBWT2 pre-print | |
http://arxiv.org/abs/1406.0426 | |
It is the transcript of a presentation made within the Medvedev group at Penn State | |
in July 2014. It focuses on illustrating some notions from the methods, and | |
illustrating and proving Algorithm 1. While this document does not cover the main | |
contribution of the RopeBWT2 paper, I hope it can be helpful towards understanding | |
the theoretical foundations that led to Algorithms 2 and 3. | |
BWT | |
--- | |
First, recall how the Burrows Wheeler Transform of a word s, noted BWT(s), is computed. | |
As an example, let s=abra$, where $ is a special sentinel character. It is set to be | |
lexicographically smaller than all other letters. | |
Consider the suffixes of s, indexed by position of the first letter: | |
0 abra$ | |
1 bra$ | |
2 ra$ | |
3 a$ | |
4 $ | |
Now sort them by lexicographical order (effectively permuting the indices): | |
4 $ | |
3 a$ | |
0 abra$ | |
1 bra$ | |
2 ra$ | |
Note that the suffix array of "abra$" is the resulting permutation of the indices, | |
i.e. [4, 3, 0, 1, 2]. | |
Now transform each suffix by making it circular (i.e. the last letter of s | |
is immediately followed by the first letter of s): | |
4 $abra | |
3 a$abr | |
0 abra$ | |
1 bra$a | |
2 ra$ab | |
The BWT is obtained by reading the last character of each suffix in this sorted order. | |
BWT(abra$) = ar$ab | |
It is possible to do the inverse operation of reconstructing the original string from | |
its BWT, i.e. UNBWT(ar$ab) = abra$. | |
See the Wikipedia page of the BWT | |
(http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform#Explanation) | |
for the inverse algorithm. | |
Segmentation of the BWT | |
----------------------- | |
Let B be the BWT of a string s. B can be segmented as follows: for a character c that | |
appears in s, let B_c be a sub-sequence of the BWT, obtained by taking the last letters | |
of all suffixes that start with c. By construction, those letters are consecutive in the | |
BWT, because all suffixes starting with c are ordered consecutively. Thus B_c is a | |
sub-string of the BWT. The BWT is the concatenation of B_c's for all characters c, in order. | |
In our previous example B = BWT(abra$), | |
B = B_$ B_a B_b B_r | |
where | |
B_$ = a | |
B_a = r$ | |
B_b = a | |
B_r = b | |
One does not need to sort the suffixes to obtain this segmentation. It suffices to | |
observe that the length of B_c is the number of occurrences of c in s. | |
Ranking and inserting | |
--------------------- | |
Along with the segmentation of the BWT, Algorithm 1 uses two more operations: | |
rank(c,k,B) returns the number of occurrences of the character c in the string B[1..k-1]. | |
insert(c,k,B) inserts the character c at position k in B | |
Algorithm 1 - Illustration | |
-------------------------- | |
Algorithm 1 is InsertIO1(B,P) in the paper, which takes as input: | |
B (the BWT of a string T#, where # is a sentinel character) | |
P (any string) | |
It outputs BWT(T#P$), where $ is a character lexicographically larger than #, | |
but lexicographically smaller than all other letters. | |
An order of sentinels is introduced in the Methods section of the paper | |
($_0 < $_1 < ... < $_(m-1)), but not in the presentation of Algorithm 1. | |
Yet, it is important. | |
We will later see that Algorithm 1 would be incorrect if we considered that # = $. | |
Let's run Algorithm 1 on an example: InsertIO1(B = "ar#ab", P = "da") | |
Recall the segmentation of B, taking into account the new characters in da$, is: | |
B_# B_$ B_a B_b B_d B_r, | |
where | |
B_# = "a" | |
B_$ = "" | |
B_a = "r#" | |
B_b = "a" | |
B_d = "" | |
B_r = "b". | |
The steps of the Algorithm are: | |
c <- "$" | |
k <- 0 = number of occurrences of $ in B | |
// first iteration of the for loop | |
i = 1, P[i] = "a" | |
insert("a", 0, B_$), now B_$ = "a" | |
k <- rank("a", 0, B_$) + [number of occurrences of "a" in B_# ] | |
<- 0 + 1 = 1 | |
c <- "a" | |
// second iteration of the for loop | |
i = 0, P[i] = "d" | |
insert("d", 1, B_a), now B_a = "rd#" | |
k <- rank("d", 1, B_a) + [number of occurrences of "d" in B_#, B_$, B_a and B_b] | |
<- 0 + 0 = 0 | |
c <- "d" | |
// third iteration of the for loop | |
i = -1, P[i] = "$" | |
insert("$", 0, B_d), now B_d = "$" | |
// we can skip the assignments of k and c as this is the last iteration | |
Finally, B = aard#a$b is the result of the algorithm. | |
One can verify that indeed, BWT(abra#da$) = aard#a$b (when doing so, recall that # < $). | |
Algorithm 1 needs ordered sentinels | |
------------------------------------ | |
A quick observation is that T must not end with the sentinel $ (the one corresponding | |
to the string that will be appended), else Algorithm 1 would be incorrect. | |
To see this, we will run Algorithm 1 on the string abra$da$ and see that it does | |
not update the BWT correctly. | |
First, observe that BWT(abra$da$) = aadr$a$b | |
(Here are the details: | |
0 abra$da$ 7 $abra$da | |
1 bra$da$a 4 $da$abra | |
2 ra$da$ab 6 a$abra$d | |
3 a$da$abr -> 3 a$da$abr | |
4 $da$abra 0 abra$da$ | |
5 da$abra$ 1 bra$da$a | |
6 a$abra$d 5 da$abra$ | |
7 $abra$da 2 ra$da$ab) | |
The first two iterations of Algorithm 1 are: | |
c <- "$" | |
k <- 1 = number of occurrences of $ in B | |
// first iteration of the for loop | |
i = 1, P[i] = "a" | |
insert("a", 1, B_$), now B_$ = "aa" | |
k <- rank("a", 1, B_$) + [ no letter is smaller than $] | |
<- 1 + 0 = 1 | |
c <- "a" | |
// second iteration of the for loop | |
i = 0, P[i] = "d" | |
insert("d", 1, B_a) | |
B_a is now "rd$" whereas it should be "dr$". | |
Proof of Algorithm 1 | |
-------------------- | |
We need to prove that the insert() operations are done at the right index, in the | |
right string B_c. | |
A similar property was proven in [Bauer 2013]. We restate it in this Lemma: | |
Lemma 1: | |
Let s be a string and i an integer in [0,|s|-1]. | |
Let n[i] be the number of suffixes smaller than s[i..] starting with s[i]. | |
Let B = BWT(s), segmented into strings B_c for each character c. | |
Then, n[i] is also the number of occurrences of s[i] in the string b, where b is the concatenation | |
of all B_c's with c < s[i+1] and B_c[0..n[i+1]-1] with c = s[i+1]. | |
Proof: | |
Let s[i..] = xyz, where x and y are characters, and z is a string. | |
Then, s[i+1..] = yz. (y = s[i+1]) | |
We are interested in the number of suffixes smaller than xyz that start with the character x. | |
This is equivalent to finding the number of (circularized) suffixes ending with x that are smaller | |
than yz. In other words, this is the number of suffixes smaller than yz that contribute to a | |
letter x in the BWT. | |
Suffixes that start with a letter strictly smaller than y are exactly those that correspond | |
to letters in B_c with c < y. Thus, those which end with x contribute to the letter x in B_c, | |
c < y. Thus, the number of letters x in B_c, c < y, is exactly the number of suffixes that | |
start with a letter strictly smaller than y and end with x. | |
Now, we need to count how many suffixes start with y, end with the letter x, and are smaller | |
than yz. These suffixes contribute to B_y. There are exactly n[i+1] suffixes that start with y and | |
are smaller than yz, thus n[i+1] letters in B_y before the letter corresponding to yz. | |
Thus, the number of suffixes that start with y, end with x, and are smaller than yz is the | |
number of letters x in B_y[1..n[i+1]-1]. | |
Putting it all together, the number of suffixes smaller than yz that end with x is the number | |
of x's in B_c, c < y, plus the number of x in B_y[1..n[i+1]-1]. QED. | |
Observe that the position k where P[i] is inserted in Algorithm 1 is exactly n[i+1], and the | |
segment B_c is for c = P[i+1]. The update rule for k follows Lemma 1. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment