##Alphabet α##
Lets consider the most general case of IUPAC encoding. Our alphabet α is:
α = { T, K, H, Y, G, C, W, V, A, D, S, M, B, R, N }
symbol | reverse complement | options | full name |
---|---|---|---|
T | A | T | Thymine |
K | M | G, T | Keto |
H | D | A, C, T | Not G |
Y | R | T, C | Pyrimidine |
G | C | G | Guanine |
C | G | C | Cytosine |
W | W | A, T | Weak bonds |
V | B | G, C, A | Not T |
A | T | A | Adenine |
D | H | G, A, T | Not C |
S | S | G, C | Strong bonds |
M | K | A, C | Amino |
B | V | G, T, C | Not A |
R | Y | G, A | Purine |
N | N | A, G, C, T | Any |
##Fragment ε## A fragment ε is defined as the triplet ε = { n, o, l } where:
- n is the nibble number
- o is the offset in nibble n to the start of the fragment
- l is the length of the fragment in nucleotides
##Barcode set β## Each b in the barcode set β is a pair { s, t } where:
- s is a word over the alphabet α of some length l
- t is an ordered set of fragments who's total concatenated length is l
Each read r in R is a set of nibbles. Each nibble is a nucleotide sequence with corresponding Phred quality scores.
##Quality scores## Lets assume those are encoded in the Illumina 1.8 Phred+33 so the value is encoded in ASCII. To get the Phred score we first get the ordinal of the character and than remove 33 from it.
Phred is -10 * log base 10 of p, where p is the probability of an error.
To get p we take 10 ^ -(Phred / 10).
for instance:
Ordinal('+') = 43, Phred = 43 - 33 = 10, p = 10 ^ -1 = 0.1
Ordinal('5') = 53, Phred = 53 - 33 = 20, p = 10 ^ -2 = 0.01
Ordinal(';') = 53, Phred = 59 - 33 = 26, p = 10 ^ -2.6 = 0.00251188643151
Ordinal('A') = 65, Phred = 65 - 33 = 32, p = 10 ^ -3.2 = 0.00063095734448
For each read r in R we can calculate the word for each barcode. So we get the vector Br which is a list of words over α of length card(β), and the vector Qr which is a list of corresponding quality scores, or probabilities of error for each base in each element of the vector Br.
##Score## For each read and each barcode we calculate the score of the barcode for that read.