Skip to content

Instantly share code, notes, and snippets.

@moonwatcher
Last active August 29, 2015 14:22
Show Gist options
  • Save moonwatcher/58520f407886f692270e to your computer and use it in GitHub Desktop.
Save moonwatcher/58520f407886f692270e to your computer and use it in GitHub Desktop.
iupac nucleic acid notation

##Alphabet α##

Lets consider the most general case of IUPAC encoding. Our alphabet α is:

α = { T, K, H, Y, G, C, W, V, A, D, S, M, B, R, N }

symbol reverse complement options full name
T A T Thymine
K M G, T Keto
H D A, C, T Not G
Y R T, C Pyrimidine
G C G Guanine
C G C Cytosine
W W A, T Weak bonds
V B G, C, A Not T
A T A Adenine
D H G, A, T Not C
S S G, C Strong bonds
M K A, C Amino
B V G, T, C Not A
R Y G, A Purine
N N A, G, C, T Any

##Fragment ε## A fragment ε is defined as the triplet ε = { n, o, l } where:

  • n is the nibble number
  • o is the offset in nibble n to the start of the fragment
  • l is the length of the fragment in nucleotides

##Barcode set β## Each b in the barcode set β is a pair { s, t } where:

  • s is a word over the alphabet α of some length l
  • t is an ordered set of fragments who's total concatenated length is l

Read and Nibble

Each read r in R is a set of nibbles. Each nibble is a nucleotide sequence with corresponding Phred quality scores.

##Quality scores## Lets assume those are encoded in the Illumina 1.8 Phred+33 so the value is encoded in ASCII. To get the Phred score we first get the ordinal of the character and than remove 33 from it.

Phred is -10 * log base 10 of p, where p is the probability of an error.

To get p we take 10 ^ -(Phred / 10).

for instance:

Ordinal('+') = 43, Phred = 43 - 33 = 10, p = 10 ^ -1 = 0.1
Ordinal('5') = 53, Phred = 53 - 33 = 20, p = 10 ^ -2 = 0.01
Ordinal(';') = 53, Phred = 59 - 33 = 26, p = 10 ^ -2.6 = 0.00251188643151
Ordinal('A') = 65, Phred = 65 - 33 = 32, p = 10 ^ -3.2 = 0.00063095734448

For each read r in R we can calculate the word for each barcode. So we get the vector Br which is a list of words over α of length card(β), and the vector Qr which is a list of corresponding quality scores, or probabilities of error for each base in each element of the vector Br.

##Score## For each read and each barcode we calculate the score of the barcode for that read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment