This gist shares an experiment on an alternative implementation for JUMPDEST analysis.
Some quick notes:
- The linked branch passes all
go test ./...
tests. - I'm not opening a PR since I'm not sure yet this might be convincing (explaining why below).
The main idea is to minimize the amount of bitwise operations by switching from a vector of byte
to uint64
. Using byte
means that PUSHN
where N
is greater than 8 will have to span multiple bytes, thus requiring "internal loops".
Using uint64
means having more space to resolve any PUSHN
in a single write attempt with at most one overflow without today's 2^16+2^8
decomposition.
The benchmark comparison is the following (note this is GB/s
, so +XX%
is a speedup):
name old speed new speed delta
JumpdestAnalysis_1200k-16 2.02GB/s ± 2% 2.04GB/s ± 0% +0.70% (p=0.015 n=10+10)
JumpdestHashing_1200k-16 415MB/s ± 0% 416MB/s ± 0% ~ (p=0.249 n=9+9)
JumpdestOpAnalysis/PUSH1-16 1.66GB/s ± 1% 1.19GB/s ± 0% -28.59% (p=0.000 n=9+10)
JumpdestOpAnalysis/PUSH2-16 1.78GB/s ± 1% 1.90GB/s ± 2% +6.57% (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH3-16 2.37GB/s ± 1% 2.02GB/s ± 1% -14.69% (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH4-16 2.36GB/s ± 0% 2.61GB/s ± 0% +10.36% (p=0.000 n=8+9)
JumpdestOpAnalysis/PUSH5-16 3.45GB/s ± 0% 2.34GB/s ± 1% -32.33% (p=0.000 n=9+10)
JumpdestOpAnalysis/PUSH6-16 3.62GB/s ± 1% 3.55GB/s ± 0% -1.95% (p=0.000 n=9+8)
JumpdestOpAnalysis/PUSH7-16 4.13GB/s ± 0% 3.52GB/s ± 0% -14.72% (p=0.000 n=9+8)
JumpdestOpAnalysis/PUSH8-16 3.39GB/s ± 0% 3.17GB/s ± 1% -6.62% (p=0.000 n=8+10)
JumpdestOpAnalysis/PUSH9-16 4.07GB/s ± 1% 3.49GB/s ± 0% -14.33% (p=0.000 n=10+8)
JumpdestOpAnalysis/PUSH10-16 3.49GB/s ± 1% 3.82GB/s ± 1% +9.33% (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH11-16 3.82GB/s ± 0% 4.53GB/s ± 1% +18.64% (p=0.000 n=9+10)
JumpdestOpAnalysis/PUSH12-16 3.61GB/s ± 1% 4.65GB/s ± 1% +28.90% (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH13-16 4.37GB/s ± 1% 4.78GB/s ± 1% +9.32% (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH14-16 4.41GB/s ± 0% 4.85GB/s ± 2% +9.93% (p=0.000 n=9+9)
JumpdestOpAnalysis/PUSH15-16 4.68GB/s ± 1% 6.03GB/s ± 1% +28.78% (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH16-16 5.82GB/s ± 1% 5.26GB/s ± 2% -9.59% (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH17-16 6.71GB/s ± 0% 5.80GB/s ± 1% -13.51% (p=0.000 n=9+10)
JumpdestOpAnalysis/PUSH18-16 5.59GB/s ± 0% 6.06GB/s ± 0% +8.49% (p=0.000 n=8+9)
JumpdestOpAnalysis/PUSH19-16 5.86GB/s ± 1% 6.95GB/s ± 1% +18.64% (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH20-16 5.44GB/s ± 1% 6.90GB/s ± 1% +26.79% (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH21-16 6.37GB/s ± 0% 7.03GB/s ± 1% +10.35% (p=0.000 n=9+9)
JumpdestOpAnalysis/PUSH22-16 6.26GB/s ± 1% 7.56GB/s ± 1% +20.72% (p=0.000 n=10+9)
JumpdestOpAnalysis/PUSH23-16 6.55GB/s ± 1% 8.22GB/s ± 1% +25.57% (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH24-16 6.82GB/s ± 1% 7.27GB/s ± 1% +6.57% (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH25-16 6.68GB/s ± 1% 8.07GB/s ± 1% +20.85% (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH26-16 6.52GB/s ± 0% 8.01GB/s ± 1% +22.91% (p=0.000 n=8+10)
JumpdestOpAnalysis/PUSH27-16 6.73GB/s ± 1% 9.25GB/s ± 1% +37.43% (p=0.000 n=10+9)
JumpdestOpAnalysis/PUSH28-16 6.31GB/s ± 0% 8.77GB/s ± 1% +38.92% (p=0.000 n=9+9)
JumpdestOpAnalysis/PUSH29-16 7.15GB/s ± 0% 7.97GB/s ± 1% +11.50% (p=0.000 n=9+10)
JumpdestOpAnalysis/PUSH30-16 7.02GB/s ± 1% 9.26GB/s ± 0% +32.02% (p=0.000 n=10+9)
JumpdestOpAnalysis/PUSH31-16 7.28GB/s ± 0% 11.04GB/s ± 1% +51.66% (p=0.000 n=9+10)
JumpdestOpAnalysis/PUSH32-16 7.93GB/s ± 1% 9.72GB/s ± 0% +22.55% (p=0.000 n=10+9)
JumpdestOpAnalysis/JUMPDEST-16 2.08GB/s ± 0% 2.03GB/s ± 0% -2.66% (p=0.000 n=9+10)
JumpdestOpAnalysis/STOP-16 2.08GB/s ± 0% 2.03GB/s ± 1% -2.59% (p=0.000 n=10+9)
The benchmark shows most of what we can expect. For "big PUSHNs," we have a speedup since we can set the right bits in the vector in a single bitwise operation.
But the surprise comes for PUSH1
(and PUSH5
/PUSH9
), where we get a performance hit.
I dislike the PUSH1
case the most because it's an important instruction used in practice. PUSH20
is another important instruction that gets a very nice performance improvement.
I looked into the PUSH1
case in more detail:
- The current implementation is doing the same
setBit1(...)
style, so a single OR operation. - I compared the assembly implementations of
master
and this branch, and I confirmed thatsetBit1
(and also other method calls) were inlined. - Looking at the assembly, it seems that the Go compiler prefers doing an
AND+BTSL+MOVB
for themaster
case and aBTSQ+MOVQ
for this case.
Reg the latter point, looks like the conclusion is that doing an OR to set a specific bit in a uint64
might be slower than byte
. I'm not entirely convinced this justifies a 30% hit, but comparing the assembly outputs, I see no other difference than the word-size instructions. Setting a bit in a uint64
can't be done faster than this.
The other PUSH5/PUSH9
cases are probably more related to the way benchmarks were constructed, where the bytecode is a stream of PUSHN
instructions. Depending on how N
fits in 64 bits could make a difference.
Look how PUSH31
performs well because 2x(PUSH31+31bits)
fits exactly in uint64
without overflows.
I've done other experiments like using a vector of uint32
instead of uint64
, which would have the same benefits since a PUSH32
would still, at most, touch two elements.
The results were a bit worse than expected, but I just wanted to see if this uint32
style could make the PUSH1
case faster (which wasn't the case).
Despite the majority of cases (and PUSH20
in particular) having a really good speedup, I'm not entirely convinced the PUSH1
might be acceptable. I don't have the intuition to say it can be justified compared to what we could gain for the other cases (that's a more nuanced discussion).
To be fair, the current benchmarks (which I left untouched) are very synthetic and not representative of real bytecode. This explains why the benchmark results might be not that smooth since the N in PUSHN alignement with 64bits matters a lot — but that argument can be a slippery slope!