Replace memcpy calls in the CRC32+copy interleaved path with direct NEON stores
(vst1q_u64) of already-loaded vectors, and direct scalar stores of already-loaded
uint64_t values. This eliminates redundant load/store sequences that the compiler
generated for memcpy when the source data was already in registers.
Additionally, reorder the vector loop so that stores happen before eor3 operations,
reducing register pressure and allowing the compiler to better interleave stores with
PMULL multiplies.
- CPU: Apple M3 (8 cores)
- RAM: 24 GB
- OS: macOS (Darwin 24.6.0), arm64
- Compiler: Apple clang 17.0.0 (clang-1700.6.4.2)
- Build: CMake Release, static library
- Baseline: 519 instructions (
crc32_copy_armv8_pmull_eor3) - Optimized: 508 instructions (−2.1%)
- Hot loop: eliminated 9 redundant 16-byte loads from the
memcpycodegen; stores now interleaved with PMULL operations for better scheduling.
Comparing baseline to optimized
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------
crc32_copy/armv8_pmull_eor3/32_median +0.0560 +0.0561 6 6 6 6
crc32_copy/armv8_pmull_eor3/512_median -0.0323 -0.0325 37 36 37 36
crc32_copy/armv8_pmull_eor3/8192_median -0.1461 -0.1466 193 165 192 164
crc32_copy/armv8_pmull_eor3/32768_median -0.1989 -0.1993 591 474 589 472
crc32_copy/armv8_pmull_eor3/65536_median -0.2163 -0.2189 1119 877 1115 871
crc32_copy/armv8_pmull_eor3_aligned/32_median +0.0006 +0.0002 4 4 4 4
crc32_copy/armv8_pmull_eor3_aligned/512_median -0.0379 -0.0366 36 34 36 34
crc32_copy/armv8_pmull_eor3_aligned/8192_median -0.1497 -0.1494 190 162 190 161
crc32_copy/armv8_pmull_eor3_aligned/32768_median -0.1992 -0.1990 588 471 586 469
crc32_copy/armv8_pmull_eor3_aligned/65536_median -0.2265 -0.2275 1116 863 1112 859
| Size | Speedup (CPU median) |
|---|---|
| 512 | ~3% |
| 8K | ~15% |
| 32K | ~20% |
| 64K | ~22% |
No regression on the non-copy CRC32 path (COPY=0 codepath is unaffected). All 1314 CRC32 gtests pass.