k-mer encoding and decoding

Functions:

# encoding: ACTG

def nuc2int_nochecking(b):
    return (ord(b) >> 1) & 3, True
    
def nuc2int_if(b):
    if b == 'a' or b == 'c' or b == 'g' or b == 't' \
        or b == 'A' or b == 'C' or b == 'G' or b == 'T':
        return (ord(b) >> 1) & 3, True
    return 0, False

def nuc2int_in(b):
    if b in 'actgACTG':
        return (ord(b) >> 1) & 3, True
    return 0, False

def nuc2int_bitwise(b):
    d = ord(b) - 65
    if d <= 51 and 0x8004500080045 & (1 << d) > 0:
        return (ord(b) >> 1) & 3, True
    return 0, False

# m  = [4 for x in range(256)]
# for b in 'acgtACGT': m[ord(b)] = (ord(b)>>1)&3
m = (4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 0, 4, 1, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 0, 4, 1, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
    4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4)

 def nuc2int_array(b):
    v = m[ord(b)]
    if v < 4:
        return v, True
    return 0, False

Test:

import random
seq = [random.choice('ACTGactg') for x in range(150)]

r1 = [x[0] for x in [nuc2int_nochecking(b) for b in seq]]
r2 = [x[0] for x in [nuc2int_if(b) for b in seq]]
r3 = [x[0] for x in [nuc2int_in(b) for b in seq]]
r4 = [x[0] for x in [nuc2int_bitwise(b) for b in seq]]
r5 = [x[0] for x in [nuc2int_array(b) for b in seq]]

r1 == r2 and  r2 == r3 and r3 == r4  and r4 == r5
True

Benchmark:

import random
seq = [random.choice('ACTGactg') for x in range(150)]


timeit([nuc2int_nochecking(b) for b in seq])
21.3 µs ± 340 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

timeit([nuc2int_if(b) for b in seq])
31.4 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

timeit([nuc2int_in(b) for b in seq])
24.8 µs ± 591 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

timeit([nuc2int_bitwise(b) for b in seq])
40 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

timeit([nuc2int_array(b) for b in seq])
18.3 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Speed:

array mapping    bitwise op: (ord(b) >> 1) & 3
-------------    --------------------------------------------------
array         >  no_checking   >  index    >   if      > bitwise
                 -------------   ---------   ---------   ----------
                 no base check   str index   if/switch   bit tricks

Benchmark results in Golang, the ranks are different.

The magic number 0x8004500080045:

# https://twitter.com/wDimD/status/1550508801018793986

In [1]: for x in 'ACGTacgt': print("{:64b} {}".format(1 << (ord(x) - 65), x))
                                                               1 A
                                                             100 C
                                                         1000000 G
                                            10000000000000000000 T
                               100000000000000000000000000000000 a
                             10000000000000000000000000000000000 c
                         100000000000000000000000000000000000000 g
            1000000000000000000000000000000000000000000000000000 t


In [2]: print("{:b}".format(0x0008004500080045))
1000000000000100010100000000000010000000000001000101

In [3]: print("{:x}".format(sum(1 << (ord(x) - 65) for x in 'ACGTacgt')))
8004500080045

char  decimal  binary    ACGT      ACTG
A     65       01000001  00^00=00  00
C     67       01000011  00^01=01  01
G     71       01000111  01^11=10  11
T     84       01010100  01^10=11  10

toCodeACTG: (c >> 1) & 3
toCharACTG: (1196704577 >> (x << 3)) & 255
complementACTG: x ^ 2

toCodeACGT: ( (c >> 2) ^ (c >> 1) ) & 3    or   3 & (c >> 2 ^ c >> 1)
toCharACGT: (1413956417 >> (x << 3)) & 255
complementACGT: x ^ 3

Where 1196704577 is:

   G        T        C        A
0b 01000111 01010100 01000011 01000001

The logic is very simple: the static constant acts as the array of 4 nucleotides and the code is just indexing the array. ---- Giulio Ermanno Pibiri, https://twitter.com/giulio_pibiri/status/1550224922621927426

Note that the encoding methods does not handle non-ACGT bases.

In my test (Go), (1413956417 >> (x << 3)) & 255 is slower than a direct mapping with slice/arrary/list bit2byte(x).

Link: https://twitter.com/shenwei356/status/1550157513433452544

shenwei356/benchmark-encoding.md