Just dropping my experience here for future readers.
For a while now I was having random file corruption with ZFS:
Once in a while after scrubbing there would be 1 to 5 CKSUM errors found on some files:
pool: data
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub in progress since Sat Nov 11 01:00:08 2023
13.7T scanned at 360M/s, 12.4T issued at 325M/s, 17.1T total
576K repaired, 72.39% done, 04:13:42 to go
config:
NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-ST6000DM003-2CY186_ZR12LE38 ONLINE 0 0 6 (repairing)
ata-ST6000DM003-2CY186_ZR12MMQG ONLINE 0 0 5 (repairing)
ata-ST6000DM003-2CY186_ZR12N3H6 ONLINE 0 0 4
ata-ST6000DM003-2CY186_ZR12NXRA ONLINE 0 0 4
ata-ST6000DM003-2CY186_ZSB010A4 ONLINE 0 0 5 (repairing)
errors: Permanent errors have been detected in the following files:
/data/Backups/coldbackup/stream/backup.2023-02-07_23.log.gz
/data/Backups/coldbackup/stream/backup.2023-02-20_11.log.gz
/data/Backups/coldbackup/stream/backup.2023-10-08_06.log.gz
The weird thing was that ZFS found the CKSUM error on all replicas ( raidz1 ) leading to the file being lost, which should have tipped me off.
I did make the mistake to run ZFS on SMR drives so I didn't think too much of it and just planned to replace them with CMR drives since the pool was hitting 80% anyway.
After replacing all the drives, 4 files were found with bad CKSUM which I kinda expected when reading so much data from such crappy drives.
But then, I ran a final scrub on the new drives after cleaning the errors, and once again a CKSUM error popped up.
I ended up firing up memtest86+ to let it run over night and there it was:
The memory started to throw errors late in the tests ( #8 and #9 ) and the following data corruption could be noticed on random CPU cores and random DIMMs:
This motherf***er
V
0001100000000111111101001101011111100101000010100001111000101101 1807f4d7e50a1e2d <- expected
0001100000000111111101001101011111100101000000100001111000101101 1807f4d7e5021e2d <- read-back
1111110100011000110110000000100000011001000111001010101100010100 fd18d808191cab14 <- expected
1111110100011000110110000000100000011001000101001010101100010100 fd18d8081914ab14 <- read-back
A single bit was causing all the trouble.
TLDR: I was running ryzen with DDR4 and an XMP profile at 3200Mhz. I lowered the speed to 3000Mhz and memtest stopped reporting bad RAM. I should have trusted ZFS and understood that 2 drives could not return the same error at the same time.
My assumption is that ZFS checksumed the data in memory, a bit flipped and the corrupted data got written to both disks.
ALWAYS test your hardware (and BIOS settings) pre-prod!!!!