Skip to content

Instantly share code, notes, and snippets.

@nmoinvaz
Last active March 7, 2026 00:02
Show Gist options
  • Select an option

  • Save nmoinvaz/aa52f71ed2dec6793e0294279f9d65a7 to your computer and use it in GitHub Desktop.

Select an option

Save nmoinvaz/aa52f71ed2dec6793e0294279f9d65a7 to your computer and use it in GitHub Desktop.
zlib-ng: CRC32 ARMv8 PMULL+EOR3 copy optimization

zlib-ng: CRC32 ARMv8 PMULL+EOR3 copy optimization

Summary

Replace memcpy calls in the CRC32+copy interleaved path with direct NEON stores (vst1q_u64) of already-loaded vectors, and direct scalar stores of already-loaded uint64_t values. This eliminates redundant load/store sequences that the compiler generated for memcpy when the source data was already in registers.

Additionally, reorder the vector loop so that stores happen before eor3 operations, reducing register pressure and allowing the compiler to better interleave stores with PMULL multiplies.

Machine specs

  • CPU: Apple M3 (8 cores)
  • RAM: 24 GB
  • OS: macOS (Darwin 24.6.0), arm64
  • Compiler: Apple clang 17.0.0 (clang-1700.6.4.2)
  • Build: CMake Release, static library

Assembly impact

  • Baseline: 519 instructions (crc32_copy_armv8_pmull_eor3)
  • Optimized: 508 instructions (−2.1%)
  • Hot loop: eliminated 9 redundant 16-byte loads from the memcpy codegen; stores now interleaved with PMULL operations for better scheduling.

Benchmark results (crc32_copy, 5 repetitions)

Comparing baseline to optimized
Benchmark                                                          Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------
crc32_copy/armv8_pmull_eor3/32_median                           +0.0560         +0.0561             6             6             6             6
crc32_copy/armv8_pmull_eor3/512_median                          -0.0323         -0.0325            37            36            37            36
crc32_copy/armv8_pmull_eor3/8192_median                         -0.1461         -0.1466           193           165           192           164
crc32_copy/armv8_pmull_eor3/32768_median                        -0.1989         -0.1993           591           474           589           472
crc32_copy/armv8_pmull_eor3/65536_median                        -0.2163         -0.2189          1119           877          1115           871
crc32_copy/armv8_pmull_eor3_aligned/32_median                   +0.0006         +0.0002             4             4             4             4
crc32_copy/armv8_pmull_eor3_aligned/512_median                  -0.0379         -0.0366            36            34            36            34
crc32_copy/armv8_pmull_eor3_aligned/8192_median                 -0.1497         -0.1494           190           162           190           161
crc32_copy/armv8_pmull_eor3_aligned/32768_median                -0.1992         -0.1990           588           471           586           469
crc32_copy/armv8_pmull_eor3_aligned/65536_median                -0.2265         -0.2275          1116           863          1112           859

Key takeaways

Size Speedup (CPU median)
512 ~3%
8K ~15%
32K ~20%
64K ~22%

No regression on the non-copy CRC32 path (COPY=0 codepath is unaffected). All 1314 CRC32 gtests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment