Replace memcpy calls in the CRC32+copy interleaved path with direct NEON stores
(vst1q_u64) of already-loaded vectors, and direct scalar stores of already-loaded
uint64_t values. This eliminates redundant load/store sequences that the compiler
generated for memcpy when the source data was already in registers.
Additionally, reorder the vector loop so that stores happen before eor3 operations,