I always staked on ZFS before the merge, using a number of SATA SSDs in a simple stripe configuration, adding more as my space requirements increased. The merge imposed additional load on my disks that meant my setup was no longer appropriate; this sent me down a long road of testing and optimization. Let me say this up front, there are definitely more performant setups for this than ZFS. I've heard of very good results using mdadm
and a simple ext4
filesystem (XFS
also works). However, there are so many useful features baked into ZFS (compression, snapshots) and the ergonomics are so good that I was compelled to make this work for my (aging) setup.
I settled on a single fio
benchmark for comparing my different setups, based on sar
/iostat
analyses of working setups. It is as follows: sudo fio --name=randrw --rw=randrw --direct=1 --ioengine=libaio --bs=4k --numjobs=8 --rwmixread=20 --size=1G --runtime=600 --group_reporting
. This will lay down several files and perform random reads and writes. I always deleted these files between tests, although that may not be necessary. The read/write mix was based on my execution client (Erigon) being fairly write heavy, I imagine it's similar to other EL clients. Please note that this benchmark will never perfectly capture the IO demands of your setup, it's just a synthetic test to use as a reference.
In my experience, at minimum, you'll want to produce the following from the above benchmark:
read: IOPS=6500, BW=25MiB/s
write: IOPS=25k, BW=100MiB/s
Pool geometry (stripes, mirrors, raidz etc) makes a huge difference in what performance/redudancy profile you're trying to achieve. You could write a book about this so I'll just say that if pure performance is your concern, a simple flat stripe with no parity will be your best bet. However, I think making mirror pairs and striping them would also be very performant and give you easier disaster recovery. I took a lot of the performance recommendations from this excellent article.
ZFS controls almost all of it's tuning through properties on the ZFS dataset. I used the following successfully:
recordsize=4K
compression=lz4
atime=off
xattr=sa
redundant_metadata=most
primarycache=metadata
(This will slow down thefio
benchmark but in theory should be faster in Erigon since it handles it's own caching)
I got most of these from another great article that goes over a lot of the "why" behind these recommendations. I would stay away from the checksum
and the sync
properties as you may deeply regret it down the road.
- My original pool had pathological performance issues that only arose when it was stressed from the merge. I still don't know how those came about, but my recommendation would be to setup everything (geometry, properties) before adding any data to the pool and then don't change them.
- If you have the know-how, having a backup validator on AWS or the like gives you a lot more freedom to experiment and pay attention to the details. Don't rush these things, there's a lot on the line.
- Keep detailed notes about all these tests. The numbers start to add up and you'll start second-guessing yourself. I use Trello for all my projects and love it.
- If you're syncing from scratch on Erigon, I recommend setting
--db.pagesize=16K
in the Erigon command and settingrecordsize=16K
to take advantage of ZFS compression. It may sound counter-intuitive but compression presents such a minor compute overhead compared to your IO latency that you'll actually get a performance boost from the disks needing to address fewer sectors than it would if the data were uncompressed.
Yeesh, yeah that's not great. Apparently OpenZFS is trending towards features that are more SSD focused, so maybe there will be a big release at some point that fixes this issue for our usecase.. doesn't seem viable at the moment though.
@abhishektvz I'd be curious to see the results of that experiment!
@yorickdowne how are you generating those stats btw?