Comparing https://github.com/marshallpierce/rust-base64/tree/perf-optimization to https://github.com/aklomp/base64.
All tests are on an i7-6850K.
f17906e 4 bytes at a time, read individually
test decode_100b ... bench: 154 ns/iter (+/- 0) = 649 MB/s
test decode_100b_reuse_buf ... bench: 128 ns/iter (+/- 0) = 781 MB/s
test decode_10mib ... bench: 14,392,694 ns/iter (+/- 177,189) = 728 MB/s
test decode_10mib_reuse_buf ... bench: 12,921,534 ns/iter (+/- 708,616) = 811 MB/s
test decode_30mib ... bench: 43,486,014 ns/iter (+/- 477,785) = 723 MB/s
test decode_30mib_reuse_buf ... bench: 38,836,887 ns/iter (+/- 42,624) = 809 MB/s
test decode_3b ... bench: 37 ns/iter (+/- 0) = 108 MB/s
test decode_3b_reuse_buf ... bench: 14 ns/iter (+/- 0) = 285 MB/s
test decode_3kib ... bench: 3,708 ns/iter (+/- 5) = 828 MB/s
test decode_3kib_reuse_buf ... bench: 3,665 ns/iter (+/- 8) = 838 MB/s
test decode_3mib ... bench: 4,196,733 ns/iter (+/- 36,054) = 749 MB/s
test decode_3mib_reuse_buf ... bench: 3,806,653 ns/iter (+/- 10,710) = 826 MB/s
test decode_500b ... bench: 626 ns/iter (+/- 1) = 798 MB/s
test decode_500b_reuse_buf ... bench: 608 ns/iter (+/- 1) = 822 MB/s
test decode_50b ... bench: 160 ns/iter (+/- 68) = 325 MB/s
test decode_50b_reuse_buf ... bench: 66 ns/iter (+/- 0) = 787 MB/s
4d6d81a naive aklomp-style loop accumulating bytes in a u64
test decode_100b ... bench: 232 ns/iter (+/- 6) = 431 MB/s
test decode_100b_reuse_buf ... bench: 205 ns/iter (+/- 3) = 487 MB/s
test decode_10mib ... bench: 23,344,214 ns/iter (+/- 896,645) = 449 MB/s
test decode_10mib_reuse_buf ... bench: 21,912,356 ns/iter (+/- 11,511,975) = 478 MB/s
test decode_30mib ... bench: 69,619,614 ns/iter (+/- 3,055,922) = 451 MB/s
test decode_30mib_reuse_buf ... bench: 65,532,939 ns/iter (+/- 309,728) = 480 MB/s
test decode_3b ... bench: 37 ns/iter (+/- 29) = 108 MB/s
test decode_3b_reuse_buf ... bench: 27 ns/iter (+/- 1) = 148 MB/s
test decode_3kib ... bench: 6,578 ns/iter (+/- 327) = 467 MB/s
test decode_3kib_reuse_buf ... bench: 6,041 ns/iter (+/- 67) = 508 MB/s
test decode_3mib ... bench: 6,782,024 ns/iter (+/- 134,704) = 463 MB/s
test decode_3mib_reuse_buf ... bench: 6,462,232 ns/iter (+/- 3,234,497) = 486 MB/s
test decode_500b ... bench: 1,018 ns/iter (+/- 36) = 491 MB/s
test decode_500b_reuse_buf ... bench: 1,040 ns/iter (+/- 15) = 480 MB/s
test decode_50b ... bench: 134 ns/iter (+/- 1) = 388 MB/s
test decode_50b_reuse_buf ... bench: 109 ns/iter (+/- 3) = 477 MB/s
8361c1e use byteorder to write instead of push()
test decode_100b ... bench: 189 ns/iter (+/- 8) = 529 MB/s
test decode_100b_reuse_buf ... bench: 167 ns/iter (+/- 2) = 598 MB/s
test decode_10mib ... bench: 19,623,929 ns/iter (+/- 2,479,538) = 534 MB/s
test decode_10mib_reuse_buf ... bench: 17,921,543 ns/iter (+/- 1,501,785) = 585 MB/s
test decode_30mib ... bench: 58,865,367 ns/iter (+/- 162,679) = 534 MB/s
test decode_30mib_reuse_buf ... bench: 53,721,337 ns/iter (+/- 95,730) = 585 MB/s
test decode_3b ... bench: 37 ns/iter (+/- 0) = 108 MB/s
test decode_3b_reuse_buf ... bench: 15 ns/iter (+/- 0) = 266 MB/s
test decode_3kib ... bench: 5,140 ns/iter (+/- 11) = 597 MB/s
test decode_3kib_reuse_buf ... bench: 5,120 ns/iter (+/- 8) = 600 MB/s
test decode_3mib ... bench: 5,708,706 ns/iter (+/- 53,053) = 551 MB/s
test decode_3mib_reuse_buf ... bench: 5,289,426 ns/iter (+/- 12,308) = 594 MB/s
test decode_500b ... bench: 861 ns/iter (+/- 4) = 580 MB/s
test decode_500b_reuse_buf ... bench: 841 ns/iter (+/- 1) = 594 MB/s
test decode_50b ... bench: 114 ns/iter (+/- 0) = 456 MB/s
test decode_50b_reuse_buf ... bench: 93 ns/iter (+/- 0) = 559 MB/s
f3c8891 write via mutable slice
test decode_100b ... bench: 113 ns/iter (+/- 4) = 884 MB/s
test decode_100b_reuse_buf ... bench: 96 ns/iter (+/- 9) = 1041 MB/s
test decode_10mib ... bench: 10,833,452 ns/iter (+/- 248,055) = 967 MB/s
test decode_10mib_reuse_buf ... bench: 9,396,666 ns/iter (+/- 107,273) = 1115 MB/s
test decode_30mib ... bench: 32,653,938 ns/iter (+/- 351,150) = 963 MB/s
test decode_30mib_reuse_buf ... bench: 28,486,029 ns/iter (+/- 133,783) = 1104 MB/s
test decode_3b ... bench: 41 ns/iter (+/- 10) = 97 MB/s
test decode_3b_reuse_buf ... bench: 19 ns/iter (+/- 0) = 210 MB/s
test decode_3kib ... bench: 2,623 ns/iter (+/- 11) = 1171 MB/s
test decode_3kib_reuse_buf ... bench: 2,589 ns/iter (+/- 15) = 1186 MB/s
test decode_3mib ... bench: 3,153,039 ns/iter (+/- 80,266) = 997 MB/s
test decode_3mib_reuse_buf ... bench: 2,725,100 ns/iter (+/- 14,198) = 1154 MB/s
test decode_500b ... bench: 447 ns/iter (+/- 1) = 1118 MB/s
test decode_500b_reuse_buf ... bench: 432 ns/iter (+/- 1) = 1157 MB/s
test decode_50b ... bench: 78 ns/iter (+/- 0) = 666 MB/s
test decode_50b_reuse_buf ... bench: 58 ns/iter (+/- 0) = 896 MB/s
87d62a9 read chunk via read_u64
test decode_100b ... bench: 106 ns/iter (+/- 1) = 943 MB/s
test decode_100b_reuse_buf ... bench: 86 ns/iter (+/- 1) = 1162 MB/s
test decode_10mib ... bench: 9,420,025 ns/iter (+/- 505,041) = 1113 MB/s
test decode_10mib_reuse_buf ... bench: 8,051,450 ns/iter (+/- 66,643) = 1302 MB/s
test decode_30mib ... bench: 28,706,956 ns/iter (+/- 101,182) = 1095 MB/s
test decode_30mib_reuse_buf ... bench: 24,469,885 ns/iter (+/- 110,068) = 1285 MB/s
test decode_3b ... bench: 41 ns/iter (+/- 0) = 97 MB/s
test decode_3b_reuse_buf ... bench: 18 ns/iter (+/- 1) = 222 MB/s
test decode_3kib ... bench: 2,300 ns/iter (+/- 11) = 1335 MB/s
test decode_3kib_reuse_buf ... bench: 2,183 ns/iter (+/- 10) = 1407 MB/s
test decode_3mib ... bench: 2,738,919 ns/iter (+/- 21,591) = 1148 MB/s
test decode_3mib_reuse_buf ... bench: 2,321,102 ns/iter (+/- 8,093) = 1355 MB/s
test decode_500b ... bench: 385 ns/iter (+/- 0) = 1298 MB/s
test decode_500b_reuse_buf ... bench: 368 ns/iter (+/- 1) = 1358 MB/s
test decode_50b ... bench: 71 ns/iter (+/- 0) = 732 MB/s
test decode_50b_reuse_buf ... bench: 50 ns/iter (+/- 0) = 1040 MB/s
3c9bc2b Move error return outside of loop for another 10%
test decode_100b ... bench: 105 ns/iter (+/- 0) = 952 MB/s
test decode_100b_reuse_buf ... bench: 83 ns/iter (+/- 0) = 1204 MB/s
test decode_10mib ... bench: 9,114,564 ns/iter (+/- 91,280) = 1150 MB/s
test decode_10mib_reuse_buf ... bench: 7,723,483 ns/iter (+/- 64,869) = 1357 MB/s
test decode_30mib ... bench: 27,708,742 ns/iter (+/- 183,745) = 1135 MB/s
test decode_30mib_reuse_buf ... bench: 23,482,819 ns/iter (+/- 35,089) = 1339 MB/s
test decode_3b ... bench: 41 ns/iter (+/- 0) = 97 MB/s
test decode_3b_reuse_buf ... bench: 18 ns/iter (+/- 1) = 222 MB/s
test decode_3kib ... bench: 2,108 ns/iter (+/- 4) = 1457 MB/s
test decode_3kib_reuse_buf ... bench: 2,146 ns/iter (+/- 12) = 1431 MB/s
test decode_3mib ... bench: 2,641,237 ns/iter (+/- 24,619) = 1191 MB/s
test decode_3mib_reuse_buf ... bench: 2,225,719 ns/iter (+/- 13,955) = 1413 MB/s
test decode_500b ... bench: 372 ns/iter (+/- 1) = 1344 MB/s
test decode_500b_reuse_buf ... bench: 353 ns/iter (+/- 0) = 1416 MB/s
test decode_50b ... bench: 68 ns/iter (+/- 1) = 764 MB/s
test decode_50b_reuse_buf ... bench: 49 ns/iter (+/- 0) = 1061 MB/s
dfb2864 Calculate error byte at error time rather than writing to a local that's read later.
test decode_100b ... bench: 96 ns/iter (+/- 5) = 1041 MB/s
test decode_100b_reuse_buf ... bench: 75 ns/iter (+/- 1) = 1333 MB/s
test decode_10mib ... bench: 8,754,671 ns/iter (+/- 274,348) = 1197 MB/s
test decode_10mib_reuse_buf ... bench: 7,334,820 ns/iter (+/- 129,159) = 1429 MB/s
test decode_30mib ... bench: 26,617,411 ns/iter (+/- 622,087) = 1181 MB/s
test decode_30mib_reuse_buf ... bench: 22,387,310 ns/iter (+/- 182,809) = 1405 MB/s
test decode_3b ... bench: 42 ns/iter (+/- 2) = 95 MB/s
test decode_3b_reuse_buf ... bench: 19 ns/iter (+/- 1) = 210 MB/s
test decode_3kib ... bench: 2,004 ns/iter (+/- 37) = 1532 MB/s
test decode_3kib_reuse_buf ... bench: 1,976 ns/iter (+/- 40) = 1554 MB/s
test decode_3mib ... bench: 2,536,762 ns/iter (+/- 217,774) = 1240 MB/s
test decode_3mib_reuse_buf ... bench: 2,114,590 ns/iter (+/- 23,871) = 1487 MB/s
test decode_500b ... bench: 352 ns/iter (+/- 10) = 1420 MB/s
test decode_500b_reuse_buf ... bench: 341 ns/iter (+/- 11) = 1466 MB/s
test decode_50b ... bench: 67 ns/iter (+/- 0) = 776 MB/s
test decode_50b_reuse_buf ... bench: 46 ns/iter (+/- 0) = 1130 MB/s
aklomp base64 in C (we only care about plain for now) with gcc 4.9.4:
% make -C test benchmark && ./test/benchmark | grep -E '(buffer|plain)'
Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 10 * 1
plain encode 1870.24 MB/sec
plain decode 1788.04 MB/sec
Testing with buffer size 1 MB, fastest of 10 * 10
plain encode 1883.38 MB/sec
plain decode 1801.25 MB/sec
Testing with buffer size 100 KB, fastest of 10 * 100
plain encode 1884.75 MB/sec
plain decode 1799.54 MB/sec
Testing with buffer size 10 KB, fastest of 100 * 100
plain encode 1880.95 MB/sec
plain decode 1799.46 MB/sec
Testing with buffer size 1 KB, fastest of 100 * 1000
plain encode 1762.49 MB/sec
plain decode 1751.79 MB/sec
C compiled with clang 3.7.1:
Filling buffer with 10.0 MB of random data...
Testing with buffer size 10 MB, fastest of 10 * 1
plain encode 1594.81 MB/sec
plain decode 1595.16 MB/sec
Testing with buffer size 1 MB, fastest of 10 * 10
plain encode 1601.13 MB/sec
plain decode 1603.28 MB/sec
Testing with buffer size 100 KB, fastest of 10 * 100
plain encode 1600.60 MB/sec
plain decode 1602.10 MB/sec
Testing with buffer size 10 KB, fastest of 100 * 100
plain encode 1603.04 MB/sec
plain decode 1602.35 MB/sec
Testing with buffer size 1 KB, fastest of 100 * 1000
plain encode 1512.35 MB/sec
plain decode 1565.88 MB/sec