Created
December 27, 2019 22:10
-
-
Save TomConlin/6cd976151d36dd3e2a9fb34842b9c66e to your computer and use it in GitHub Desktop.
fasta binary formats exploration with Julia (v0.2?)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2019 - almost a decade ago now I helped mentor | |
a CS student writing a bioinformatics exercise in C++. | |
To help avoid doing his work I did not use C++. | |
Instead I experimented in Julia; probably v0.2 or v0.3. | |
The task was to compare ginormas metagenomics studies quickly. | |
################################################# | |
read a large file fast | |
open file handle | |
determine file length | |
(assume file is contiguous on disk) | |
memory map? reads larger blocks | |
what is disk read cache size? | |
what are intermediate cache sizes? | |
what is the time to fetch a cache full? | |
what is the time to process a cache full? | |
"C" 100 00 1 1 | |
"G" 100 01 1 1 | |
"A" 100 00 0 1 | |
"T" 101 01 0 0 | |
"N" 100 11 1 0 | |
for c in "AaCcTtGgNn" println(c,' ',c&3) end | |
A 1 | |
a 1 | |
C 3 | |
c 3 | |
G 3 | |
g 3 | |
T 0 | |
t 0 | |
N 2 | |
n 2 | |
for c in "AaCcTtGgNn" println (c, ' ', c&7) end | |
A 1 | |
a 1 | |
C 3 | |
c 3 | |
T 4 | |
t 4 | |
G 7 | |
g 7 | |
N 6 | |
n 6 | |
0,1,2,3 | |
[A,C,T,G] | |
Translating ASCII DNA into a two bit representation | |
using bitwise AND 3 on the nt-base works great | |
the fact this is case insensitive is due to the | |
brilliant ASCII designers of yore. | |
Having to deal with unknown 'N' bases complicates | |
processing and for now introduces a branch condition. | |
Bitwise AND with 7 followed by a bitwise shift allows | |
filtering for N's | |
c&=7 | |
6==c ? c=rand(0:3) : c=c<<1 | |
Another option is a 3/4 bit representation | |
three bit seems awkward for processing and leaves us with three "slots" undefined | |
hmmm. | |
any representation less than a word (including byte) | |
would benefit from a "padding" flag | |
indicating the representation is to be ignored | |
so the ALU could process full words. | |
Four bit (nibble) is half a byte and would be easier to process and leaves 11 "slots" | |
which may be enough handle nucleotide ambiguity codes ... | |
or indels or carry a coarse quality score. | |
For some tasks, e.g. find GC content | |
the representation is irrelevant as we do not store and reuse the sequence. | |
in which case the 2-bit transformation is sufficient | |
or the 3-bit transformation if 'N' is present | |
With N present there are three options | |
N is always GC (left as an exercise) | |
N is never GC GC is nt&1*(nt>>1)&1 | |
N is sometimes GC CG is nt&=7;6==nt?rand(0:1):(nt>>1)&1 | |
##################################################################### | |
A 4bit (nibble) level encoding compatible with two-bit encoding | |
which allows gaps and ambiguity but requires lookup tables. | |
0000 A | |
0001 C | |
0010 T | |
0011 G | |
0100 !A 'B' | |
0101 !C 'D' | |
0110 !T 'V' | |
0111 !G 'H' | |
1000 !TA | |
1001 !TC | |
1010 !CA | |
1011 !TG | |
1100 !AG | |
1101 !CG | |
1110 <gap> | |
1111 N | |
0,1,2,3 | |
[A,C,T,G] | |
############################################################### | |
translating the 4 base nucleotides from the two-bit (dense) representation | |
to a four-bit (sparse) representation (flag based) | |
mathematically b4 = 2^b2 | |
but that is apt to be more expensive than necessary | |
b4=1<<(b2+1) is cheaper | |
1,2,4,8 | |
[A,C,T,G] | |
A nibble level encoding for comparison/alignment with ambiguity | |
which does not need lookup for comparison | |
AND-ing two encodings results in an alignment | |
with the less ambiguous of two retained when matched. | |
a & b -> 0 no match | |
-> !0 match | |
nyb nt(s) ambiguity | |
0 0000 <gap> - | |
1 0001 A A* | |
2 0010 C C* | |
3 0011 AC M | |
4 0100 T T* | |
5 0101 TA W | |
6 0110 TC Y | |
7 0111 TCA H | |
8 1000 G G* | |
9 1001 GA R | |
10 1010 GC S | |
11 1011 GCA V | |
12 1100 GT K | |
13 1101 GTA D | |
14 1110 GTC B | |
15 1111 GTCA N | |
gaps can be used as padding to fill to the end of a machine word | |
given sequence from the restricted alphabet "AaCcTtUuGgNn" | |
(not the full ambiguity code) | |
c&=7 | |
6==c ? rand(0:3) : 1<<(c>>1) | |
will encode the four primary bases without lookup | |
translating N to random base at the cost of a branch. | |
or | |
c&=7; | |
6==c ? CONST : 1<<(c>>1) | |
CONST = 15 will let N match everything. | |
CONST = 0 will make N match nothing. | |
or 1<<(c&7)>>1 | |
will make N always be a 'G' which will be the cheapest | |
for the restricted alphabet, ANDing two 64bit words | |
(16 nucleotide bases encoded as nibbles within a 64bit word) | |
the only case where more than one bit in a nibble is set when matched is: | |
when an N is compared with N _and_ N is set to match anything | |
then the nibble will have all 4 bits set. | |
short of that pathological case, | |
the number of bases which matched is the 'Hamming Weight' | |
or 'population count' of number of bits set in the resulting word. | |
hence in julia | |
16 == popcount( seqA[word_n] & seqB[word_n] ) | |
means all 16 bases in both sequences matched | |
note: popcount() is a native machine instruction in many | |
modern architectures which performs "sideways addition. | |
so a popcount() of 15 ~> 93.75% Similar and 14 -> 87.5% Similar | |
######################################################################### | |
k-mers | |
N's and ambiguity have no place | |
they can neither be nor match 'nothing'. | |
they must be constant i.e 'G' or random. | |
tabulating what a sliding window of a size represents, | |
2-bit representation 32-mer | |
3-bit representation 21-mer | |
4-bit representation 16-mer | |
existence is reportedly all that is being recorded | |
K-mer Index is: | |
a sparse structure? | |
a bitvector? | |
for 2 bit dense encoding | |
4^32 = 1.844674407×10¹⁹ flags take /8 | |
2.305843009×10¹⁸ bytes or /1024 | |
2.251799814×10¹⁵ KB or /1024 | |
2.199023256×10¹² MB or /1024 | |
2,147,483,648 GB or /1024 | |
2,097,152 TB or /1024 | |
2,048 PetB or /1024 | |
2 ExoB | |
will not be doing a bit vector for 32-mers then | |
4 bit flag encoding | |
1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 greatest | |
- 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 least | |
____________________________________________________ | |
0111011101110111011101110111011101110111011101110111011101110111 | |
2,101,679,826,000,000 ... but there far fewer possible sequences than that | |
4^16 = 4,294,967,296 flags take / 8 | |
536,870,912 bytes or / 1024 | |
524,288 KB or / 1024\q | |
512M | |
that could almost live in a L3 cache or two and be indexed directly | |
with a half word | |
@time index=BitArray(2^32-1) | |
# ~17ms | |
# break apart the fasta records and feed the sequence to the mer-indexer | |
function process_fasta(filepath::ASCIIString, fx::function, merdex::BitArray{1} ) | |
mer::Int32 | |
row::ASCIIString="" | |
rows::Array{ASCIIString} = open(readlines, filepath) | |
for row in rows | |
if('>' == first(row)) | |
continue | |
else | |
fx(row,merdex) | |
end | |
end | |
end | |
# finds hexamers in a sequence & records their existence in an index | |
function hexamer(seq::ASCIIString, merdex::BitArray{1}) | |
mer=0 | |
# prime | |
for i 0:15 mer|=((seq[i+1]&7)>>1)<<2i end | |
merdex[mer]=true | |
for i 17:length(row)-17 | |
mer=mer<<2 | |
mer|=(seq[i]&7)>>1 | |
merdex[mer]=true | |
end | |
end # ~hexamer |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment