Functions:
# encoding: ACTG
def nuc2int_nochecking(b):
return (ord(b) >> 1) & 3, True
def nuc2int_if(b):
if b == 'a' or b == 'c' or b == 'g' or b == 't' \
or b == 'A' or b == 'C' or b == 'G' or b == 'T':
return (ord(b) >> 1) & 3, True
return 0, False
def nuc2int_in(b):
if b in 'actgACTG':
return (ord(b) >> 1) & 3, True
return 0, False
def nuc2int_bitwise(b):
d = ord(b) - 65
if d <= 51 and 0x8004500080045 & (1 << d) > 0:
return (ord(b) >> 1) & 3, True
return 0, False
# m = [4 for x in range(256)]
# for b in 'acgtACGT': m[ord(b)] = (ord(b)>>1)&3
m = (4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 0, 4, 1, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 0, 4, 1, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4)
def nuc2int_array(b):
v = m[ord(b)]
if v < 4:
return v, True
return 0, False
Test:
import random
seq = [random.choice('ACTGactg') for x in range(150)]
r1 = [x[0] for x in [nuc2int_nochecking(b) for b in seq]]
r2 = [x[0] for x in [nuc2int_if(b) for b in seq]]
r3 = [x[0] for x in [nuc2int_in(b) for b in seq]]
r4 = [x[0] for x in [nuc2int_bitwise(b) for b in seq]]
r5 = [x[0] for x in [nuc2int_array(b) for b in seq]]
r1 == r2 and r2 == r3 and r3 == r4 and r4 == r5
True
Benchmark:
import random
seq = [random.choice('ACTGactg') for x in range(150)]
timeit([nuc2int_nochecking(b) for b in seq])
21.3 µs ± 340 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
timeit([nuc2int_if(b) for b in seq])
31.4 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
timeit([nuc2int_in(b) for b in seq])
24.8 µs ± 591 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
timeit([nuc2int_bitwise(b) for b in seq])
40 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
timeit([nuc2int_array(b) for b in seq])
18.3 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Speed:
array mapping bitwise op: (ord(b) >> 1) & 3
------------- --------------------------------------------------
array > no_checking > index > if > bitwise
------------- --------- --------- ----------
no base check str index if/switch bit tricks
Benchmark results in Golang, the ranks are different.
The magic number 0x8004500080045
:
# https://twitter.com/wDimD/status/1550508801018793986
In [1]: for x in 'ACGTacgt': print("{:64b} {}".format(1 << (ord(x) - 65), x))
1 A
100 C
1000000 G
10000000000000000000 T
100000000000000000000000000000000 a
10000000000000000000000000000000000 c
100000000000000000000000000000000000000 g
1000000000000000000000000000000000000000000000000000 t
In [2]: print("{:b}".format(0x0008004500080045))
1000000000000100010100000000000010000000000001000101
In [3]: print("{:x}".format(sum(1 << (ord(x) - 65) for x in 'ACGTacgt')))
8004500080045