TomConlin · December 27, 2019 22:10
diff --git a/FastFastaRead.readme b/FastFastaRead.readme
 2019 - almost a decade ago now I helped mentor
 a CS student writing a bioinformatics exercise in C++.

 To help avoid doing his work I did not use C++.
 Instead I experimented in Julia; probably v0.2 or v0.3.

 The task was to compare ginormas metagenomics studies quickly.



 #################################################
 read a large file fast


 open file handle

 determine file length

 (assume file is contiguous on disk)

 memory map? reads larger blocks

 what is disk read cache size?

 what are intermediate cache sizes?

 what is the time to fetch a cache full?
 what is the time to process a cache full?


 "C" 100 00 1 1
 "G" 100 01 1 1

 "A" 100 00 0 1
 "T" 101 01 0 0
 "N" 100 11 1 0


 for c in "AaCcTtGgNn"  println(c,' ',c&3) end
 A 1
 a 1
 C 3
 c 3
 G 3
 g 3
 T 0
 t 0
 N 2
 n 2


 for c in "AaCcTtGgNn" println (c, ' ', c&7) end
 A 1
 a 1
 C 3
 c 3
 T 4
 t 4
 G 7
 g 7
 N 6
 n 6


 0,1,2,3
 [A,C,T,G]


 Translating ASCII DNA into a two bit representation
 using bitwise  AND 3 on the nt-base works great
 the fact this is case insensitive is due to the
 brilliant ASCII designers of yore.
 Having to deal with unknown 'N' bases complicates
 processing and for now introduces a branch condition.

 Bitwise AND with 7 followed by a bitwise shift allows
 filtering for N's

 c&=7
 6==c ? c=rand(0:3) : c=c<<1


 Another option is a 3/4 bit representation
 three bit seems awkward for processing and leaves us with three "slots" undefined
 hmmm.
 any representation less than a word (including byte)
 would benefit from a "padding" flag
 indicating the representation is to be ignored
 so the ALU could process full words.


 Four bit (nibble) is half a byte and would be easier to process and leaves 11 "slots"
 which may be enough handle nucleotide ambiguity codes ...
 or indels or carry a coarse quality score.


 For some tasks, e.g. find GC content
 the representation is irrelevant as we do not store and reuse the sequence.
 in which case the 2-bit transformation is sufficient
 or the 3-bit transformation if 'N' is present

 With N present there are three options
 	N is always GC 		   (left as an exercise)
 	N is never GC			GC is nt&1*(nt>>1)&1
 	N is sometimes GC		CG is nt&=7;6==nt?rand(0:1):(nt>>1)&1


 #####################################################################

 A 4bit (nibble) level encoding compatible with two-bit encoding
 which allows gaps and ambiguity but requires lookup tables.


 0000 A
 0001 C
 0010 T
 0011 G

 	0100 !A  'B'
 	0101 !C	 'D'
 	0110 !T  'V'
 	0111 !G  'H'

 		1000 !TA
 		1001 !TC
 		1010 !CA
 		1011 !TG

 			1100 !AG
 			1101 !CG

 			1110 <gap>

 			1111 N


 0,1,2,3
 [A,C,T,G]
 ###############################################################

 translating the 4 base nucleotides from the two-bit (dense) representation
 to a four-bit (sparse) representation (flag based)
 mathematically b4 = 2^b2
 but that is apt to be more expensive than necessary

 b4=1<<(b2+1) is cheaper

 1,2,4,8
 [A,C,T,G]


 A nibble level encoding for comparison/alignment with ambiguity
 which does not need lookup for comparison

 AND-ing two encodings results in an alignment
 with the less ambiguous of two retained when matched.


 a & b -> 0  no match
      -> !0 match

 	nyb		nt(s)		ambiguity

 0	0000	<gap>	 	-
 1	0001	A		 	A*
 2	0010	C		 	C*
 3	0011	AC			M
 4	0100	T			T*
 5	0101	TA			W
 6	0110	TC			Y
 7	0111	TCA			H
 8	1000	G			G*
 9	1001	GA			R
 10	1010	GC			S
 11	1011	GCA			V
 12	1100	GT			K
 13	1101	GTA			D
 14	1110	GTC			B
 15	1111	GTCA		N


 gaps can be used as padding to fill to the end of a machine word



 given sequence from the restricted alphabet "AaCcTtUuGgNn"
 (not the full ambiguity code)




 c&=7
 6==c ? rand(0:3) : 1<<(c>>1)

 will encode the four primary bases without lookup
 translating N to random base at the cost of a branch.


 or
 c&=7;
 6==c ? CONST : 1<<(c>>1)

 CONST = 15 will let N match everything.
 CONST =  0 will make N match nothing.


 or 1<<(c&7)>>1

 will make N  always be a 'G'  which will be the cheapest


 for the restricted alphabet, ANDing two 64bit words
 (16 nucleotide bases encoded as nibbles within a 64bit word)
 the only case where more than one bit in a nibble is set when matched is:
 when an N is compared with N _and_ N is set to match anything
 then the nibble will have all 4 bits set.
 short of that pathological case,

 the number of bases which matched is the 'Hamming Weight'
 or 'population count' of number of bits set in the resulting word.

 hence in julia

 16 == popcount( seqA[word_n] & seqB[word_n] )

 means all 16 bases in both sequences matched
 note: popcount() is a native machine instruction in many
 modern architectures which performs "sideways addition.

 so a popcount() of  15 ~> 93.75% Similar and 14 -> 87.5% Similar

 #########################################################################


 k-mers

 N's and ambiguity have no place
 they can neither be nor match 'nothing'.
 they must be constant i.e 'G' or random.

 tabulating what a sliding window of a size represents,

 2-bit representation 32-mer
 3-bit representation 21-mer
 4-bit representation 16-mer

 existence is reportedly all that is being recorded

 K-mer Index is:
 	a sparse structure?
 	a bitvector?

 for 2 bit dense encoding

 4^32 =  1.844674407×10¹⁹ flags take /8
 		2.305843009×10¹⁸ bytes or /1024
 		2.251799814×10¹⁵ KB    or /1024
 	    2.199023256×10¹² MB    or /1024
 	    2,147,483,648 GB       or /1024
 	    2,097,152 TB           or /1024
 	    2,048 PetB             or /1024
 	    2 ExoB
 will not be doing a bit vector for 32-mers then



 4 bit flag encoding

  1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000  greatest
 - 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001  least
  ____________________________________________________
  0111011101110111011101110111011101110111011101110111011101110111


 2,101,679,826,000,000  ... but there far fewer possible sequences than that

 4^16 = 4,294,967,296 flags take / 8
 		536,870,912 bytes  or    / 1024
 		524,288 KB or            / 1024\q
 		512M

 that could almost live in a L3 cache or two and be indexed directly
 with a half word


 @time index=BitArray(2^32-1)
 # ~17ms

 # break apart the fasta records and feed the sequence to the mer-indexer
 function process_fasta(filepath::ASCIIString, fx::function, merdex::BitArray{1} )
 	mer::Int32
 	row::ASCIIString=""
    rows::Array{ASCIIString} = open(readlines, filepath)
 	for row in rows
 		if('>' == first(row))
 			continue
 		else
 			fx(row,merdex)
 		end
 	end
 end

 # finds hexamers in a sequence & records their existence in an index
 function hexamer(seq::ASCIIString, merdex::BitArray{1})
 	mer=0
 	# prime
 	for i 0:15 mer|=((seq[i+1]&7)>>1)<<2i end
 	merdex[mer]=true
 	for i 17:length(row)-17
 		mer=mer<<2
 		mer|=(seq[i]&7)>>1
 		merdex[mer]=true
 	end
 end # ~hexamer
	2019 - almost a decade ago now I helped mentor
	a CS student writing a bioinformatics exercise in C++.

	To help avoid doing his work I did not use C++.
	Instead I experimented in Julia; probably v0.2 or v0.3.

	The task was to compare ginormas metagenomics studies quickly.



	#################################################
	read a large file fast


	open file handle

	determine file length

	(assume file is contiguous on disk)

	memory map? reads larger blocks

	what is disk read cache size?

	what are intermediate cache sizes?

	what is the time to fetch a cache full?
	what is the time to process a cache full?


	"C" 100 00 1 1
	"G" 100 01 1 1

	"A" 100 00 0 1
	"T" 101 01 0 0
	"N" 100 11 1 0


	for c in "AaCcTtGgNn" println(c,' ',c&3) end
	A 1
	a 1
	C 3
	c 3
	G 3
	g 3
	T 0
	t 0
	N 2
	n 2


	for c in "AaCcTtGgNn" println (c, ' ', c&7) end
	A 1
	a 1
	C 3
	c 3
	T 4
	t 4
	G 7
	g 7
	N 6
	n 6


	0,1,2,3
	[A,C,T,G]


	Translating ASCII DNA into a two bit representation
	using bitwise AND 3 on the nt-base works great
	the fact this is case insensitive is due to the
	brilliant ASCII designers of yore.
	Having to deal with unknown 'N' bases complicates
	processing and for now introduces a branch condition.

	Bitwise AND with 7 followed by a bitwise shift allows
	filtering for N's

	c&=7
	6==c ? c=rand(0:3) : c=c<<1


	Another option is a 3/4 bit representation
	three bit seems awkward for processing and leaves us with three "slots" undefined
	hmmm.
	any representation less than a word (including byte)
	would benefit from a "padding" flag
	indicating the representation is to be ignored
	so the ALU could process full words.


	Four bit (nibble) is half a byte and would be easier to process and leaves 11 "slots"
	which may be enough handle nucleotide ambiguity codes ...
	or indels or carry a coarse quality score.


	For some tasks, e.g. find GC content
	the representation is irrelevant as we do not store and reuse the sequence.
	in which case the 2-bit transformation is sufficient
	or the 3-bit transformation if 'N' is present

	With N present there are three options
	N is always GC (left as an exercise)
	N is never GC GC is nt&1*(nt>>1)&1
	N is sometimes GC CG is nt&=7;6==nt?rand(0:1):(nt>>1)&1


	#####################################################################

	A 4bit (nibble) level encoding compatible with two-bit encoding
	which allows gaps and ambiguity but requires lookup tables.


	0000 A
	0001 C
	0010 T
	0011 G

	0100 !A 'B'
	0101 !C 'D'
	0110 !T 'V'
	0111 !G 'H'

	1000 !TA
	1001 !TC
	1010 !CA
	1011 !TG

	1100 !AG
	1101 !CG

	1110 <gap>

	1111 N


	0,1,2,3
	[A,C,T,G]
	###############################################################

	translating the 4 base nucleotides from the two-bit (dense) representation
	to a four-bit (sparse) representation (flag based)
	mathematically b4 = 2^b2
	but that is apt to be more expensive than necessary

	b4=1<<(b2+1) is cheaper

	1,2,4,8
	[A,C,T,G]


	A nibble level encoding for comparison/alignment with ambiguity
	which does not need lookup for comparison

	AND-ing two encodings results in an alignment
	with the less ambiguous of two retained when matched.


	a & b -> 0 no match
	-> !0 match

	nyb nt(s) ambiguity

	0 0000 <gap> -
	1 0001 A A*
	2 0010 C C*
	3 0011 AC M
	4 0100 T T*
	5 0101 TA W
	6 0110 TC Y
	7 0111 TCA H
	8 1000 G G*
	9 1001 GA R
	10 1010 GC S
	11 1011 GCA V
	12 1100 GT K
	13 1101 GTA D
	14 1110 GTC B
	15 1111 GTCA N


	gaps can be used as padding to fill to the end of a machine word



	given sequence from the restricted alphabet "AaCcTtUuGgNn"
	(not the full ambiguity code)




	c&=7
	6==c ? rand(0:3) : 1<<(c>>1)

	will encode the four primary bases without lookup
	translating N to random base at the cost of a branch.


	or
	c&=7;
	6==c ? CONST : 1<<(c>>1)

	CONST = 15 will let N match everything.
	CONST = 0 will make N match nothing.


	or 1<<(c&7)>>1

	will make N always be a 'G' which will be the cheapest


	for the restricted alphabet, ANDing two 64bit words
	(16 nucleotide bases encoded as nibbles within a 64bit word)
	the only case where more than one bit in a nibble is set when matched is:
	when an N is compared with N _and_ N is set to match anything
	then the nibble will have all 4 bits set.
	short of that pathological case,

	the number of bases which matched is the 'Hamming Weight'
	or 'population count' of number of bits set in the resulting word.

	hence in julia

	16 == popcount( seqA[word_n] & seqB[word_n] )

	means all 16 bases in both sequences matched
	note: popcount() is a native machine instruction in many
	modern architectures which performs "sideways addition.

	so a popcount() of 15 ~> 93.75% Similar and 14 -> 87.5% Similar

	#########################################################################


	k-mers

	N's and ambiguity have no place
	they can neither be nor match 'nothing'.
	they must be constant i.e 'G' or random.

	tabulating what a sliding window of a size represents,

	2-bit representation 32-mer
	3-bit representation 21-mer
	4-bit representation 16-mer

	existence is reportedly all that is being recorded

	K-mer Index is:
	a sparse structure?
	a bitvector?

	for 2 bit dense encoding

	4^32 = 1.844674407×10¹⁹ flags take /8
	2.305843009×10¹⁸ bytes or /1024
	2.251799814×10¹⁵ KB or /1024
	2.199023256×10¹² MB or /1024
	2,147,483,648 GB or /1024
	2,097,152 TB or /1024
	2,048 PetB or /1024
	2 ExoB
	will not be doing a bit vector for 32-mers then



	4 bit flag encoding

	1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 greatest
	- 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 least
	____________________________________________________
	0111011101110111011101110111011101110111011101110111011101110111


	2,101,679,826,000,000 ... but there far fewer possible sequences than that

	4^16 = 4,294,967,296 flags take / 8
	536,870,912 bytes or / 1024
	524,288 KB or / 1024\q
	512M

	that could almost live in a L3 cache or two and be indexed directly
	with a half word


	@time index=BitArray(2^32-1)
	# ~17ms

	# break apart the fasta records and feed the sequence to the mer-indexer
	function process_fasta(filepath::ASCIIString, fx::function, merdex::BitArray{1} )
	mer::Int32
	row::ASCIIString=""
	rows::Array{ASCIIString} = open(readlines, filepath)
	for row in rows
	if('>' == first(row))
	continue
	else
	fx(row,merdex)
	end
	end
	end

	# finds hexamers in a sequence & records their existence in an index
	function hexamer(seq::ASCIIString, merdex::BitArray{1})
	mer=0
	# prime
	for i 0:15 mer\|=((seq[i+1]&7)>>1)<<2i end
	merdex[mer]=true
	for i 17:length(row)-17
	mer=mer<<2
	mer\|=(seq[i]&7)>>1
	merdex[mer]=true
	end
	end # ~hexamer