iupac nucleic acid notation

##Alphabet α##

Lets consider the most general case of IUPAC encoding. Our alphabet α is:

α = { T, K, H, Y, G, C, W, V, A, D, S, M, B, R, N }

symbol	reverse complement	options	full name
T	A	T	Thymine
K	M	G, T	Keto
H	D	A, C, T	Not G
Y	R	T, C	Pyrimidine
G	C	G	Guanine
C	G	C	Cytosine
W	W	A, T	Weak bonds
V	B	G, C, A	Not T
A	T	A	Adenine
D	H	G, A, T	Not C
S	S	G, C	Strong bonds
M	K	A, C	Amino
B	V	G, T, C	Not A
R	Y	G, A	Purine
N	N	A, G, C, T	Any

##Fragment ε## A fragment ε is defined as the triplet ε = { n, o, l } where:

n is the nibble number
o is the offset in nibble n to the start of the fragment
l is the length of the fragment in nucleotides

##Barcode set β## Each b in the barcode set β is a pair { s, t } where:

s is a word over the alphabet α of some length l
t is an ordered set of fragments who's total concatenated length is l

Read and Nibble

Each read r in R is a set of nibbles. Each nibble is a nucleotide sequence with corresponding Phred quality scores.

##Quality scores## Lets assume those are encoded in the Illumina 1.8 Phred+33 so the value is encoded in ASCII. To get the Phred score we first get the ordinal of the character and than remove 33 from it.

Phred is -10 * log base 10 of p, where p is the probability of an error.

To get p we take 10 ^ -(Phred / 10).

for instance:

Ordinal('+') = 43, Phred = 43 - 33 = 10, p = 10 ^ -1 = 0.1
Ordinal('5') = 53, Phred = 53 - 33 = 20, p = 10 ^ -2 = 0.01
Ordinal(';') = 53, Phred = 59 - 33 = 26, p = 10 ^ -2.6 = 0.00251188643151
Ordinal('A') = 65, Phred = 65 - 33 = 32, p = 10 ^ -3.2 = 0.00063095734448

For each read r in R we can calculate the word for each barcode. So we get the vector Br which is a list of words over α of length card(β), and the vector Qr which is a list of corresponding quality scores, or probabilities of error for each base in each element of the vector Br.

##Score## For each read and each barcode we calculate the score of the barcode for that read.

moonwatcher/nucleotide.md

Read and Nibble