Last active
March 14, 2021 20:55
-
-
Save rchikhi/ac3d7e7b71e5547cef72a80c8e358283 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Wheeler graphs | |
Gagie, Manzini, Siren | |
Theoretical Computer Science, 2017 | |
https://www.sciencedirect.com/science/article/pii/S0304397517305285 | |
Notes of a whiteboard presentation to the Bonsai team in Lille. | |
These notes largely follow the paper. | |
Rayan Chikhi, 2019 | |
Some history (not in the paper): | |
- 1927 David Wheeler born | |
- 1951 1st PhD in compsci (Wheeler) | |
- 1953 subroutines invented, replacing GOTOs (Wheeler, Wilkes) | |
- 1963 Michael Burrows born | |
- 1983 (B)WT discovered by Wheeler | |
- 1988 Burrows PhD thesis (in Cambridge) | |
- 1994 BWT article | |
- 1995 AltaVista (Burrows wrote the indexer, Monier the crawler) | |
- 1996 bzip2 (Seward, same author as valgrind) | |
- 2000 FM-index | |
- 2005 XBWT | |
- 2012 BOSS | |
- 2017 Wheeler graphs | |
Due to seemingly unrelated BWT variants on tree, graphs, alignments, | |
(some not even invertible nor compressible,) paper starts with the | |
blind man / elephant analogy: | |
one touching a leg thinking it's a tree, | |
the trunk thinking it's a snake, | |
the tail thinking it's a rope. | |
Setting: | |
- totally ordered alphabet | |
- directed (multi-)graph, | |
each edge labeled by a single character | |
Wheeler graph: | |
- there exists an ordering (not necessarily total) | |
of the nodes such that: | |
- nodes with in-degree=0 precede all others | |
- given two edges: | |
u --a --> v | |
u' --a'--> v' | |
a < a' => v < v' | |
(else, a=a') u < u' => v <= v' | |
Example: | |
[ ] -a-> [ ] -b-> [ ] | |
\____________b___/^ | |
is a Wheeler graph, with possible node order | |
1 2 3 | |
But if we changed e.g. the first b into a c, it wouldn't remain a Wheeler graph. | |
Consequence: | |
- all edges entering a given node have the same label | |
(otherwise, u,u' be the in-neighbors of v=v', a<a' implies v!=v', contrad) | |
Succinct encoding of Wheeler graphs: | |
- binary vectors I and O of in/out-degrees (resp.) unary encoded using 0's separated by 1's, | |
- string L of outgoing edges labels, given in the order of the nodes, | |
- C array (number of edges with label <=c). | |
For the graph above: | |
1 -a-> 2 -b-> 3 | |
\________b__/^ | |
O = 001011 | |
L = ab b | |
I = 101001 | |
*In the paper, the authors directly present Path Coherence, but instead I've started with the motivation example* | |
Motivating example: | |
- Without using suffix array or BWT, | |
build a data structure supporting efficient substring membership queries within a string s | |
- using a deterministic finite automaton (DFA) -> the suffix trie of s | |
- using a non-deterministic finite automaton (NFA): | |
- linear path consisting of all characters of s, | |
- as well as all links from root (=epsilon-transitions) | |
- all states are accepting | |
Recall the definitions from Wikipedia: | |
- DFA and NFA = (states, alphabet, transition function, start state, accept states) | |
- transition function in DFA: (state,letter) -> state | |
- in NFA: (state,letter) -> set of states | |
- DFA accepts s = exists sequences of states start,r_1,..,r_n\in accept, such that r_{i+1} = transition(r_i,s_i) | |
- NFA, same but r_{i+1} \in transition(..) | |
Rest of motivating example: | |
- DFA is a Wheeler graph | |
- NFA is not, we need to: | |
- eliminate epsilon-transitions, | |
because all incoming nodes don't have same label in original NFA (but in Wheeler graphs, they do), | |
- make all states initial | |
- order nodes according to reversed lexi order of prefixes leading to that node | |
Example with s=ABRACA (not in the Wheeler graph paper - my own): | |
NFA (with nodes labelled by the prefix just for visual clarity): | |
epsilon | |
-A-> A | |
-B-> AB | |
-R-> ABR | |
-A-> ABRA | |
-C-> ABRAC | |
-A-> ABRACA | |
reverse lexicographical prefixes: | |
epsilon 0 | |
A 1 | |
ABRACA 2 | |
ABRA 3 | |
AB 4 | |
ABRAC 5 | |
ABR 6 | |
Wheeler graph: the NFA above but with node order given by the above prefixes, and discarding node labels | |
0123456 | |
L = AB$CRAA | |
(+ the C, O and I arrays) | |
Relationship between the succinct Wheeler graph representation and BWT: | |
- since ABRACA has no out-neighbors, add a fictional one labeled by $ | |
- compare with BWT of reverse(ABRACA$)=$ACARBA: | |
$ACARBA | |
A$ACARB | |
ACARBA$ | |
ARBA$AC | |
BA$ACAR | |
CARBA$A | |
RBA$ACA | |
BWT($ACARBA) = AB$CRAA = L | |
Path coherence: | |
- if there exists a total order on the nodes, and | |
- now fix a consecutive range [i,j] and a string s | |
- consider nodes reachable from nodes [i,j] in |s| steps, following labels from s | |
- path coherence means those nodes also form a consecutive range [i',j'] | |
Lemma: | |
- Wheeler graph => path coherent | |
Proof: | |
- consider [i,j] a consecutive range and some character a | |
- let [i',j'] be the smallest range that contain all nodes reacheable from [i,j] in one step following an edge labelled by a | |
- note that (by choice of that range) i' and j' have positive in-degree | |
- by def of Wheeler graphs (nodes with in-degree 0 precede all others), all [i',j'] have positive in-degree | |
- assume there exists i'<v<j' having an incoming edge a'!=a | |
- i' < v => a < a' | |
- v < j => a' < a (Wheeler def + modus tollens) | |
- contradiction | |
- so all nodes in [i', j'] are labelled by a | |
- furthermore, now that we know that all labels are equals, | |
using last part of Wheeler def + modus tollens again, | |
all nodes with destination i'<=x<=j' must originate from [i,j] | |
- so to recap, by construction, nodes reachable from [i,j] are in the range [i',j'] | |
- but as shown above, all nodes in [i',j'] originate from [i,j] | |
- it follows that nodes reachable from [i,j] with label a are those in the consecutive range [i',j'] | |
- rest is induction on string length | |
Data structure result: | |
- n nodes, e edges, alphabet A | |
- 2(e+n) + e log(|A|) + |A|log(e) + o(n+elog(|A|)) bits. | |
- forward and backward *edge* traversal in O(log(|A|)) [-- for backward I don't yet see how, the authors don't go into details.] | |
[For backward I can see how to determine the incoming edges label in O(|A|): | |
- test for all letters whether C[c] < select(I,j) - j <=C[c+1] | |
(following the inequality, given in the paper, for testing if x_j has all incoming edges labeled c) | |
- the c that works gives the incoming edge label | |
- but this doesn't give each incoming node index] | |
Then there is another result given on smaller space encoding, based on entropy of labels, | |
but I had some problems understanding that paragraph due to the redefinition of $L_i$/$L_k$ notations. | |
Theorem (6): | |
- given a FA (DFA or NFA in fact), no epsilon-transitions, 1 initial state or all of them initial | |
- if it's a Wheeler graph, then it can be stored compactly in 2(e+n)+... space | |
- and can determine string acceptance in O(|s|log(|A|)) time | |
Proof: | |
- If it's a Wheeler graph, it's path coherent (Lemma proved above) | |
- Consider the initial states: they form a consecutive range | |
- Thus (by def of path coherence) nodes reachable from initial states form a consecutive range | |
- By induction, one can go form the interval of each prefix of s to the interval of the next prefix in O(log|A|) | |
Other Wheeler graphs: | |
- collection of cyclic strings, circular queries | |
- collection of strings through a trie | |
- a trie is a Wheeler graph | |
- XBWT (same as in tries) | |
- dBGs (BOSS without the dummy $..$ nodes) | |
- FM-index of alignments (many authors incl. Lecroq) | |
- GCSA | |
- PBWT | |
- wavelet trees | |
*This is all the second half of the Wheeler graph paper, but I skipped those other graphs in my presentation* | |
I added an example for tries (note: not suffix tries), but didn't have time to present it in the hour. | |
Consider a set of strings: | |
ABC | |
BAC | |
ABA | |
ACA | |
Corresponding trie: | |
. | |
/ \ | |
A B | |
/ \ | | |
B C A | |
/ \ | | | |
A C A C | |
- it's a Wheeler graph | |
- for substring queries, all states need to be initial | |
- also all states are accepting | |
order of nodes in Wheeler graph: | |
epsilon 0 | |
A 1 | |
BA 2 | |
ABA 3 | |
ACA 4 | |
B 5 | |
AB 6 | |
AC 7 | |
BAC 8 | |
ABC 9 | |
0 1 2 345 6 7 89 | |
O = 0010010111010010111 | |
L = AB BC C A AC A | |
Search for substring 'CA' in that graph: | |
- all states are initial so the range when traversing 'C' is [7,9] | |
- it can be obtained from the C array | |
- then, need to traverse graph by following 'A' from that range | |
- transition(7,'A') = 4 | |
- transition(8,'A') = none | |
- transition(9,'A') = none | |
- The image range is thus {4}, telling us 'CA' is there | |
[- it remains to understand why we can get away with not querying the whole range, given those 'none'. I'm likely missing something; maybe using the O array] | |
Another example, query of 'AC': | |
- range for 'A': [1,4] | |
- transition(1,'C') = 7 | |
- transition(4,'C') = none | |
- transition(3,'C') = none | |
- transition(2,'C') = 8 | |
- result range: [7,8] | |
(Just out of curiosity: if we had put $'s at the leaves of the trie, the encoding would be: | |
L = ABBCC$$AACA$$ | |
This isn't the BWT of ABA$ABC$ACA$BAC; one can see this because already they're of different lengths.) | |
--- | |
Questions from the audience: | |
- Where do Bowtie/BWA fit in the history timeline? [2009] | |
- Other articles used Wheeler graphs? [yes: tunneling, read indexing, and last week: Wheeler language] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment