ZFS CKSUM errors on all replicas -> memory corruption

Just dropping my experience here for future readers.

For a while now I was having random file corruption with ZFS:

Once in a while after scrubbing there would be 1 to 5 CKSUM errors found on some files:

   pool: data
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Sat Nov 11 01:00:08 2023
        13.7T scanned at 360M/s, 12.4T issued at 325M/s, 17.1T total
        576K repaired, 72.39% done, 04:13:42 to go
config:

        NAME                                                         STATE     READ WRITE CKSUM
        data                                                         ONLINE       0     0     0
          raidz1-0                                                   ONLINE       0     0     0
            ata-ST6000DM003-2CY186_ZR12LE38                          ONLINE       0     0     6  (repairing)
            ata-ST6000DM003-2CY186_ZR12MMQG                          ONLINE       0     0     5  (repairing)
            ata-ST6000DM003-2CY186_ZR12N3H6                          ONLINE       0     0     4
            ata-ST6000DM003-2CY186_ZR12NXRA                          ONLINE       0     0     4
            ata-ST6000DM003-2CY186_ZSB010A4                          ONLINE       0     0     5  (repairing)

errors: Permanent errors have been detected in the following files:

        /data/Backups/coldbackup/stream/backup.2023-02-07_23.log.gz
        /data/Backups/coldbackup/stream/backup.2023-02-20_11.log.gz
        /data/Backups/coldbackup/stream/backup.2023-10-08_06.log.gz

The weird thing was that ZFS found the CKSUM error on all replicas ( raidz1 ) leading to the file being lost, which should have tipped me off.

I did make the mistake to run ZFS on SMR drives so I didn't think too much of it and just planned to replace them with CMR drives since the pool was hitting 80% anyway.

After replacing all the drives, 4 files were found with bad CKSUM which I kinda expected when reading so much data from such crappy drives.

But then, I ran a final scrub on the new drives after cleaning the errors, and once again a CKSUM error popped up.

I ended up firing up memtest86+ to let it run over night and there it was:

The memory started to throw errors late in the tests ( #8 and #9 ) and the following data corruption could be noticed on random CPU cores and random DIMMs:

                                      This motherf***er
                                            V
0001100000000111111101001101011111100101000010100001111000101101 1807f4d7e50a1e2d <- expected
0001100000000111111101001101011111100101000000100001111000101101 1807f4d7e5021e2d <- read-back

1111110100011000110110000000100000011001000111001010101100010100 fd18d808191cab14 <- expected
1111110100011000110110000000100000011001000101001010101100010100 fd18d8081914ab14 <- read-back

A single bit was causing all the trouble.

TLDR: I was running ryzen with DDR4 and an XMP profile at 3200Mhz. I lowered the speed to 3000Mhz and memtest stopped reporting bad RAM. I should have trusted ZFS and understood that 2 drives could not return the same error at the same time.

My assumption is that ZFS checksumed the data in memory, a bit flipped and the corrupted data got written to both disks.

ALWAYS test your hardware (and BIOS settings) pre-prod!!!!

gboddin/ZFS.md