jumpdest.md

This gist shares an experiment on an alternative implementation for JUMPDEST analysis.

Some quick notes:

The linked branch passes all go test ./... tests.
I'm not opening a PR since I'm not sure yet this might be convincing (explaining why below).

The main idea is to minimize the amount of bitwise operations by switching from a vector of byte to uint64. Using byte means that PUSHN where N is greater than 8 will have to span multiple bytes, thus requiring "internal loops".

Using uint64 means having more space to resolve any PUSHN in a single write attempt with at most one overflow without today's 2^16+2^8 decomposition.

The benchmark comparison is the following (note this is GB/s, so +XX% is a speedup):

name                            old speed      new speed       delta
JumpdestAnalysis_1200k-16       2.02GB/s ± 2%   2.04GB/s ± 0%   +0.70%  (p=0.015 n=10+10)
JumpdestHashing_1200k-16         415MB/s ± 0%    416MB/s ± 0%     ~     (p=0.249 n=9+9)
JumpdestOpAnalysis/PUSH1-16     1.66GB/s ± 1%   1.19GB/s ± 0%  -28.59%  (p=0.000 n=9+10)
JumpdestOpAnalysis/PUSH2-16     1.78GB/s ± 1%   1.90GB/s ± 2%   +6.57%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH3-16     2.37GB/s ± 1%   2.02GB/s ± 1%  -14.69%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH4-16     2.36GB/s ± 0%   2.61GB/s ± 0%  +10.36%  (p=0.000 n=8+9)
JumpdestOpAnalysis/PUSH5-16     3.45GB/s ± 0%   2.34GB/s ± 1%  -32.33%  (p=0.000 n=9+10)
JumpdestOpAnalysis/PUSH6-16     3.62GB/s ± 1%   3.55GB/s ± 0%   -1.95%  (p=0.000 n=9+8)
JumpdestOpAnalysis/PUSH7-16     4.13GB/s ± 0%   3.52GB/s ± 0%  -14.72%  (p=0.000 n=9+8)
JumpdestOpAnalysis/PUSH8-16     3.39GB/s ± 0%   3.17GB/s ± 1%   -6.62%  (p=0.000 n=8+10)
JumpdestOpAnalysis/PUSH9-16     4.07GB/s ± 1%   3.49GB/s ± 0%  -14.33%  (p=0.000 n=10+8)
JumpdestOpAnalysis/PUSH10-16    3.49GB/s ± 1%   3.82GB/s ± 1%   +9.33%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH11-16    3.82GB/s ± 0%   4.53GB/s ± 1%  +18.64%  (p=0.000 n=9+10)
JumpdestOpAnalysis/PUSH12-16    3.61GB/s ± 1%   4.65GB/s ± 1%  +28.90%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH13-16    4.37GB/s ± 1%   4.78GB/s ± 1%   +9.32%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH14-16    4.41GB/s ± 0%   4.85GB/s ± 2%   +9.93%  (p=0.000 n=9+9)
JumpdestOpAnalysis/PUSH15-16    4.68GB/s ± 1%   6.03GB/s ± 1%  +28.78%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH16-16    5.82GB/s ± 1%   5.26GB/s ± 2%   -9.59%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH17-16    6.71GB/s ± 0%   5.80GB/s ± 1%  -13.51%  (p=0.000 n=9+10)
JumpdestOpAnalysis/PUSH18-16    5.59GB/s ± 0%   6.06GB/s ± 0%   +8.49%  (p=0.000 n=8+9)
JumpdestOpAnalysis/PUSH19-16    5.86GB/s ± 1%   6.95GB/s ± 1%  +18.64%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH20-16    5.44GB/s ± 1%   6.90GB/s ± 1%  +26.79%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH21-16    6.37GB/s ± 0%   7.03GB/s ± 1%  +10.35%  (p=0.000 n=9+9)
JumpdestOpAnalysis/PUSH22-16    6.26GB/s ± 1%   7.56GB/s ± 1%  +20.72%  (p=0.000 n=10+9)
JumpdestOpAnalysis/PUSH23-16    6.55GB/s ± 1%   8.22GB/s ± 1%  +25.57%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH24-16    6.82GB/s ± 1%   7.27GB/s ± 1%   +6.57%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH25-16    6.68GB/s ± 1%   8.07GB/s ± 1%  +20.85%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH26-16    6.52GB/s ± 0%   8.01GB/s ± 1%  +22.91%  (p=0.000 n=8+10)
JumpdestOpAnalysis/PUSH27-16    6.73GB/s ± 1%   9.25GB/s ± 1%  +37.43%  (p=0.000 n=10+9)
JumpdestOpAnalysis/PUSH28-16    6.31GB/s ± 0%   8.77GB/s ± 1%  +38.92%  (p=0.000 n=9+9)
JumpdestOpAnalysis/PUSH29-16    7.15GB/s ± 0%   7.97GB/s ± 1%  +11.50%  (p=0.000 n=9+10)
JumpdestOpAnalysis/PUSH30-16    7.02GB/s ± 1%   9.26GB/s ± 0%  +32.02%  (p=0.000 n=10+9)
JumpdestOpAnalysis/PUSH31-16    7.28GB/s ± 0%  11.04GB/s ± 1%  +51.66%  (p=0.000 n=9+10)
JumpdestOpAnalysis/PUSH32-16    7.93GB/s ± 1%   9.72GB/s ± 0%  +22.55%  (p=0.000 n=10+9)
JumpdestOpAnalysis/JUMPDEST-16  2.08GB/s ± 0%   2.03GB/s ± 0%   -2.66%  (p=0.000 n=9+10)
JumpdestOpAnalysis/STOP-16      2.08GB/s ± 0%   2.03GB/s ± 1%   -2.59%  (p=0.000 n=10+9)

The benchmark shows most of what we can expect. For "big PUSHNs," we have a speedup since we can set the right bits in the vector in a single bitwise operation. But the surprise comes for PUSH1 (and PUSH5/PUSH9), where we get a performance hit.

I dislike the PUSH1 case the most because it's an important instruction used in practice. PUSH20 is another important instruction that gets a very nice performance improvement.

I looked into the PUSH1 case in more detail:

The current implementation is doing the same setBit1(...) style, so a single OR operation.
I compared the assembly implementations of master and this branch, and I confirmed that setBit1 (and also other method calls) were inlined.
Looking at the assembly, it seems that the Go compiler prefers doing an AND+BTSL+MOVB for the master case and a BTSQ+MOVQ for this case.

Reg the latter point, looks like the conclusion is that doing an OR to set a specific bit in a uint64 might be slower than byte. I'm not entirely convinced this justifies a 30% hit, but comparing the assembly outputs, I see no other difference than the word-size instructions. Setting a bit in a uint64 can't be done faster than this.

The other PUSH5/PUSH9 cases are probably more related to the way benchmarks were constructed, where the bytecode is a stream of PUSHN instructions. Depending on how N fits in 64 bits could make a difference. Look how PUSH31 performs well because 2x(PUSH31+31bits) fits exactly in uint64 without overflows.

I've done other experiments like using a vector of uint32 instead of uint64, which would have the same benefits since a PUSH32 would still, at most, touch two elements. The results were a bit worse than expected, but I just wanted to see if this uint32 style could make the PUSH1 case faster (which wasn't the case).

Despite the majority of cases (and PUSH20 in particular) having a really good speedup, I'm not entirely convinced the PUSH1 might be acceptable. I don't have the intuition to say it can be justified compared to what we could gain for the other cases (that's a more nuanced discussion).

To be fair, the current benchmarks (which I left untouched) are very synthetic and not representative of real bytecode. This explains why the benchmark results might be not that smooth since the N in PUSHN alignement with 64bits matters a lot — but that argument can be a slippery slope!

jsign/jumpdest.md