zlib-ng: CRC32 ARMv8 PMULL+EOR3 copy optimization

Summary

Replace memcpy calls in the CRC32+copy interleaved path with direct NEON stores (vst1q_u64) of already-loaded vectors, and direct scalar stores of already-loaded uint64_t values. This eliminates redundant load/store sequences that the compiler generated for memcpy when the source data was already in registers.

Additionally, reorder the vector loop so that stores happen before eor3 operations, reducing register pressure and allowing the compiler to better interleave stores with PMULL multiplies.

Machine specs

CPU: Apple M3 (8 cores)
RAM: 24 GB
OS: macOS (Darwin 24.6.0), arm64
Compiler: Apple clang 17.0.0 (clang-1700.6.4.2)
Build: CMake Release, static library

Assembly impact

Baseline: 519 instructions (crc32_copy_armv8_pmull_eor3)
Optimized: 508 instructions (−2.1%)
Hot loop: eliminated 9 redundant 16-byte loads from the memcpy codegen; stores now interleaved with PMULL operations for better scheduling.

Benchmark results (crc32_copy, 5 repetitions)

Comparing baseline to optimized
Benchmark                                                          Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------
crc32_copy/armv8_pmull_eor3/32_median                           +0.0560         +0.0561             6             6             6             6
crc32_copy/armv8_pmull_eor3/512_median                          -0.0323         -0.0325            37            36            37            36
crc32_copy/armv8_pmull_eor3/8192_median                         -0.1461         -0.1466           193           165           192           164
crc32_copy/armv8_pmull_eor3/32768_median                        -0.1989         -0.1993           591           474           589           472
crc32_copy/armv8_pmull_eor3/65536_median                        -0.2163         -0.2189          1119           877          1115           871
crc32_copy/armv8_pmull_eor3_aligned/32_median                   +0.0006         +0.0002             4             4             4             4
crc32_copy/armv8_pmull_eor3_aligned/512_median                  -0.0379         -0.0366            36            34            36            34
crc32_copy/armv8_pmull_eor3_aligned/8192_median                 -0.1497         -0.1494           190           162           190           161
crc32_copy/armv8_pmull_eor3_aligned/32768_median                -0.1992         -0.1990           588           471           586           469
crc32_copy/armv8_pmull_eor3_aligned/65536_median                -0.2265         -0.2275          1116           863          1112           859

Key takeaways

Size	Speedup (CPU median)
512	~3%
8K	~15%
32K	~20%
64K	~22%

No regression on the non-copy CRC32 path (COPY=0 codepath is unaffected). All 1314 CRC32 gtests pass.

nmoinvaz/zlib-ng-pr-2176-opt.md

Select an option

No results found