A proof of concept implementation can be found at https://github.com/trifectatechfoundation/rust/tree/labeled-match. Build it with ./x build
, and then set up the toolchain. Now cargo +stage1 build
should use a compiler with labeled-match
available.
git clone https://github.com/trifectatechfoundation/zlib-rs.git
git checkout len-as-match
sh replicate-labeled-match-benchmarks.sh
this runs 4 benchmarks
- baseline: the current zlib-rs main branch approach using tail calls
- loop-plus-match: standard approach using a loop and match; suffers from branch misprediction
- labeled-match-len: the
len
function and friends now use labeled match - labeled-match-fast: the
len
and friends, andinflate_fast_help
functions now use labeld match
The benchmark is run for various chunk sizes (2 to the power 4, 7 and 16), which varies what logic is run: a chunk size of 2^16 spends most time in an inner loop, while 2^4 spends much more time in the state machine logic.
Mostly what we see is that labeled match gives significant speedups for small chunk sizes. For larger chunk sizes, the results are less clear. I believe really the result is net-zero, but we need clearly need to perform some further tuning.
Note in particular how loop-plus-match
works well for small and big inputs, but terribly for medium inputs. The labeled-match-fast
change barely seems to do anything versus just labeled-match-len
.
Benchmark 1 (69 runs): /tmp/uncompress-baseline rs-chunked 4 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 72.7ms ± 4.55ms 71.1ms … 108ms 3 ( 4%) 0%
peak_rss 24.1MB ± 56.3KB 23.9MB … 24.1MB 12 (17%) 0%
cpu_cycles 294M ± 17.7M 290M … 434M 6 ( 9%) 0%
instructions 914M ± 449 914M … 914M 1 ( 1%) 0%
cache_references 2.99M ± 407K 2.68M … 6.10M 4 ( 6%) 0%
cache_misses 134K ± 23.8K 101K … 302K 2 ( 3%) 0%
branch_misses 4.09M ± 8.56K 4.08M … 4.14M 2 ( 3%) 0%
Benchmark 2 (71 runs): /tmp/loop-plus-match rs-chunked 4 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 70.6ms ± 499us 69.9ms … 72.9ms 3 ( 4%) ⚡- 2.8% ± 1.5%
peak_rss 24.1MB ± 60.2KB 24.0MB … 24.1MB 0 ( 0%) - 0.1% ± 0.1%
cpu_cycles 287M ± 1.53M 285M … 294M 5 ( 7%) ⚡- 2.7% ± 1.4%
instructions 792M ± 336 792M … 792M 0 ( 0%) ⚡- 13.4% ± 0.0%
cache_references 2.92M ± 81.8K 2.77M … 3.12M 0 ( 0%) - 2.2% ± 3.2%
cache_misses 113K ± 12.4K 89.9K … 169K 3 ( 4%) ⚡- 16.0% ± 4.7%
branch_misses 4.10M ± 4.76K 4.09M … 4.13M 3 ( 4%) + 0.2% ± 0.1%
Benchmark 3 (80 runs): /tmp/labeled-match-len rs-chunked 4 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 62.6ms ± 555us 61.7ms … 66.1ms 2 ( 3%) ⚡- 14.0% ± 1.4%
peak_rss 24.1MB ± 77.9KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.1%
cpu_cycles 249M ± 1.87M 248M … 263M 5 ( 6%) ⚡- 15.4% ± 1.3%
instructions 686M ± 267 686M … 686M 0 ( 0%) ⚡- 24.9% ± 0.0%
cache_references 3.01M ± 480K 2.76M … 7.16M 2 ( 3%) + 0.5% ± 4.8%
cache_misses 123K ± 6.92K 100K … 137K 2 ( 3%) ⚡- 8.4% ± 4.1%
branch_misses 4.08M ± 2.78K 4.08M … 4.09M 4 ( 5%) - 0.1% ± 0.0%
Benchmark 4 (81 runs): /tmp/labeled-match-fast rs-chunked 4 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 61.9ms ± 468us 61.3ms … 64.0ms 6 ( 7%) ⚡- 14.8% ± 1.4%
peak_rss 24.1MB ± 56.9KB 24.0MB … 24.1MB 20 (25%) - 0.0% ± 0.1%
cpu_cycles 246M ± 1.53M 245M … 255M 12 (15%) ⚡- 16.4% ± 1.3%
instructions 689M ± 365 689M … 689M 1 ( 1%) ⚡- 24.6% ± 0.0%
cache_references 2.97M ± 211K 2.79M … 4.37M 3 ( 4%) - 0.8% ± 3.4%
cache_misses 91.4K ± 9.22K 72.0K … 114K 0 ( 0%) ⚡- 31.8% ± 4.2%
branch_misses 4.08M ± 4.06K 4.08M … 4.10M 1 ( 1%) - 0.1% ± 0.1%
Benchmark 1 (108 runs): /tmp/uncompress-baseline rs-chunked 7 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 46.3ms ± 1.37ms 45.0ms … 56.5ms 10 ( 9%) 0%
peak_rss 24.1MB ± 56.3KB 24.0MB … 24.1MB 26 (24%) 0%
cpu_cycles 174M ± 4.67M 173M … 214M 10 ( 9%) 0%
instructions 516M ± 443 516M … 516M 3 ( 3%) 0%
cache_references 3.16M ± 219K 2.88M … 4.43M 6 ( 6%) 0%
cache_misses 83.6K ± 11.6K 62.4K … 161K 5 ( 5%) 0%
branch_misses 2.00M ± 5.17K 2.00M … 2.04M 9 ( 8%) 0%
Benchmark 2 (78 runs): /tmp/loop-plus-match rs-chunked 7 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 64.5ms ± 1.29ms 63.6ms … 71.3ms 6 ( 8%) 💩+ 39.3% ± 0.8%
peak_rss 24.1MB ± 72.6KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.1%
cpu_cycles 257M ± 3.65M 255M … 279M 12 (15%) 💩+ 47.4% ± 0.7%
instructions 720M ± 384 720M … 720M 1 ( 1%) 💩+ 39.7% ± 0.0%
cache_references 3.21M ± 182K 2.97M … 4.32M 3 ( 4%) + 1.8% ± 1.9%
cache_misses 57.9K ± 8.34K 47.0K … 98.8K 4 ( 5%) ⚡- 30.7% ± 3.6%
branch_misses 2.00M ± 2.44K 2.00M … 2.01M 5 ( 6%) - 0.2% ± 0.1%
Benchmark 3 (111 runs): /tmp/labeled-match-len rs-chunked 7 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 45.2ms ± 781us 44.2ms … 51.3ms 5 ( 5%) ⚡- 2.5% ± 0.6%
peak_rss 24.1MB ± 66.5KB 23.9MB … 24.1MB 0 ( 0%) - 0.0% ± 0.1%
cpu_cycles 170M ± 3.34M 168M … 199M 9 ( 8%) ⚡- 2.8% ± 0.6%
instructions 510M ± 359 510M … 510M 1 ( 1%) ⚡- 1.0% ± 0.0%
cache_references 3.21M ± 178K 2.97M … 4.55M 6 ( 5%) + 1.7% ± 1.7%
cache_misses 31.2K ± 4.33K 25.0K … 52.0K 6 ( 5%) ⚡- 62.7% ± 2.8%
branch_misses 1.99M ± 1.35K 1.99M … 2.00M 5 ( 5%) - 0.4% ± 0.0%
Benchmark 4 (111 runs): /tmp/labeled-match-fast rs-chunked 7 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 45.1ms ± 454us 44.3ms … 46.8ms 5 ( 5%) ⚡- 2.5% ± 0.6%
peak_rss 24.1MB ± 73.0KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.1%
cpu_cycles 169M ± 1.27M 168M … 176M 10 ( 9%) ⚡- 3.0% ± 0.5%
instructions 515M ± 295 515M … 515M 0 ( 0%) - 0.1% ± 0.0%
cache_references 3.20M ± 81.1K 2.97M … 3.38M 0 ( 0%) + 1.5% ± 1.4%
cache_misses 47.5K ± 8.09K 36.6K … 72.1K 1 ( 1%) ⚡- 43.2% ± 3.2%
branch_misses 2.00M ± 1.86K 1.99M … 2.00M 2 ( 2%) - 0.2% ± 0.1%
Benchmark 1 (182 runs): /tmp/uncompress-baseline rs-chunked 16 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 27.5ms ± 474us 26.7ms … 31.4ms 7 ( 4%) 0%
peak_rss 24.1MB ± 48.1KB 24.0MB … 24.1MB 29 (16%) 0%
cpu_cycles 90.0M ± 1.27M 89.4M … 102M 17 ( 9%) 0%
instructions 239M ± 253 239M … 239M 3 ( 2%) 0%
cache_references 2.28M ± 57.3K 2.20M … 2.78M 5 ( 3%) 0%
cache_misses 48.4K ± 2.85K 43.3K … 68.4K 5 ( 3%) 0%
branch_misses 1.05M ± 1.62K 1.05M … 1.06M 2 ( 1%) 0%
Benchmark 2 (186 runs): /tmp/loop-plus-match rs-chunked 16 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 26.9ms ± 335us 26.3ms … 29.5ms 4 ( 2%) ⚡- 2.4% ± 0.3%
peak_rss 24.1MB ± 63.3KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.0%
cpu_cycles 87.2M ± 732K 86.8M … 94.7M 20 (11%) ⚡- 3.1% ± 0.2%
instructions 248M ± 262 248M … 248M 0 ( 0%) 💩+ 3.8% ± 0.0%
cache_references 2.26M ± 85.8K 2.18M … 2.97M 5 ( 3%) - 0.8% ± 0.7%
cache_misses 52.0K ± 2.13K 47.6K … 67.0K 5 ( 3%) 💩+ 7.4% ± 1.1%
branch_misses 1.05M ± 1.58K 1.04M … 1.05M 2 ( 1%) - 0.5% ± 0.0%
Benchmark 3 (182 runs): /tmp/labeled-match-len rs-chunked 16 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 27.4ms ± 407us 26.7ms … 31.6ms 3 ( 2%) - 0.4% ± 0.3%
peak_rss 24.1MB ± 64.3KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.0%
cpu_cycles 89.3M ± 985K 88.8M … 101M 13 ( 7%) - 0.8% ± 0.3%
instructions 254M ± 326 254M … 254M 2 ( 1%) 💩+ 6.1% ± 0.0%
cache_references 2.26M ± 68.6K 2.18M … 2.75M 3 ( 2%) - 0.9% ± 0.6%
cache_misses 55.1K ± 2.58K 50.5K … 68.0K 4 ( 2%) 💩+ 13.8% ± 1.2%
branch_misses 1.05M ± 2.09K 1.04M … 1.05M 0 ( 0%) - 0.4% ± 0.0%
Benchmark 4 (182 runs): /tmp/labeled-match-fast rs-chunked 16 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 27.5ms ± 703us 26.8ms … 33.7ms 6 ( 3%) + 0.0% ± 0.4%
peak_rss 24.1MB ± 62.8KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.0%
cpu_cycles 89.8M ± 1.93M 89.1M … 108M 20 (11%) - 0.3% ± 0.4%
instructions 253M ± 257 253M … 253M 1 ( 1%) 💩+ 5.7% ± 0.0%
cache_references 2.26M ± 101K 2.18M … 3.05M 7 ( 4%) - 1.1% ± 0.7%
cache_misses 50.2K ± 2.61K 45.6K … 66.8K 5 ( 3%) 💩+ 3.8% ± 1.2%
branch_misses 1.05M ± 1.14K 1.05M … 1.06M 2 ( 1%) - 0.2% ± 0.0%
On a AMD Ryzen 7 3700X 8-Core Processor: