pryce-turner/staking_zfs.md

Last active October 23, 2023 16:41

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/pryce-turner/bc14b70ff36ec11e417ef341361b2c5f.js"></script>
Save pryce-turner/bc14b70ff36ec11e417ef341361b2c5f to your computer and use it in GitHub Desktop.

Download ZIP

Ethereum POS Staking on ZFS

Raw

staking_zfs.md

Staking on ZFS

Intro

I always staked on ZFS before the merge, using a number of SATA SSDs in a simple stripe configuration, adding more as my space requirements increased. The merge imposed additional load on my disks that meant my setup was no longer appropriate; this sent me down a long road of testing and optimization. Let me say this up front, there are definitely more performant setups for this than ZFS. I've heard of very good results using mdadm and a simple ext4 filesystem (XFS also works). However, there are so many useful features baked into ZFS (compression, snapshots) and the ergonomics are so good that I was compelled to make this work for my (aging) setup.

Benchmark

I settled on a single fio benchmark for comparing my different setups, based on sar/iostat analyses of working setups. It is as follows: sudo fio --name=randrw --rw=randrw --direct=1 --ioengine=libaio --bs=4k --numjobs=8 --rwmixread=20 --size=1G --runtime=600 --group_reporting. This will lay down several files and perform random reads and writes. I always deleted these files between tests, although that may not be necessary. The read/write mix was based on my execution client (Erigon) being fairly write heavy, I imagine it's similar to other EL clients. Please note that this benchmark will never perfectly capture the IO demands of your setup, it's just a synthetic test to use as a reference.

In my experience, at minimum, you'll want to produce the following from the above benchmark: read: IOPS=6500, BW=25MiB/s write: IOPS=25k, BW=100MiB/s

Pool Geometry

Pool geometry (stripes, mirrors, raidz etc) makes a huge difference in what performance/redudancy profile you're trying to achieve. You could write a book about this so I'll just say that if pure performance is your concern, a simple flat stripe with no parity will be your best bet. However, I think making mirror pairs and striping them would also be very performant and give you easier disaster recovery. I took a lot of the performance recommendations from this excellent article.

Pool properties

ZFS controls almost all of it's tuning through properties on the ZFS dataset. I used the following successfully:

recordsize=4K
compression=lz4
atime=off
xattr=sa
redundant_metadata=most
primarycache=metadata (This will slow down the fio benchmark but in theory should be faster in Erigon since it handles it's own caching)

I got most of these from another great article that goes over a lot of the "why" behind these recommendations. I would stay away from the checksum and the sync properties as you may deeply regret it down the road.

Observations / Things to watch out for

My original pool had pathological performance issues that only arose when it was stressed from the merge. I still don't know how those came about, but my recommendation would be to setup everything (geometry, properties) before adding any data to the pool and then don't change them.
If you have the know-how, having a backup validator on AWS or the like gives you a lot more freedom to experiment and pay attention to the details. Don't rush these things, there's a lot on the line.
Keep detailed notes about all these tests. The numbers start to add up and you'll start second-guessing yourself. I use Trello for all my projects and love it.
If you're syncing from scratch on Erigon, I recommend setting --db.pagesize=16K in the Erigon command and setting recordsize=16K to take advantage of ZFS compression. It may sound counter-intuitive but compression presents such a minor compute overhead compared to your IO latency that you'll actually get a performance boost from the disks needing to address fewer sectors than it would if the data were uncompressed.

Author

pryce-turner commented Dec 3, 2022

Yeesh, yeah that's not great. Apparently OpenZFS is trending towards features that are more SSD focused, so maybe there will be a big release at some point that fixes this issue for our usecase.. doesn't seem viable at the moment though.

@abhishektvz I'd be curious to see the results of that experiment!

@yorickdowne how are you generating those stats btw?

yorickdowne commented Dec 4, 2022

@pryce-turner Curl the metrics interface and grep for engine_newPayloadV1

Author

pryce-turner commented Dec 5, 2022

@pryce-turner Curl the metrics interface and grep for engine_newPayloadV1

Awesome, thanks!

yorickdowne commented Jan 5, 2023 •

edited

Loading

The 4erigon_2 branch has improved results.

ZFS settings

recordsize             16k
compression            lz4
atime                  off
xattr                  sa
primarycache           all
logbias                throughput
sync                   standard
relatime               off

I get compressratio 1.38x with lz4. This can likely be run without issue on zstd-fast to get more compression.

Quantiles are better, but not "amazing" with WRITE_MAP=true.

rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.5"} 0.948885932
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.9"} 1.3364602159999999
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.97"} 1.405461775
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.99"} 1.5774590640000001
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="1"} 1.5774590640000001

There's an fsync happening. If I yolo this and set sync=disabled I get different values:

rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.5"} 0.125063721
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.9"} 1.192553632
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.97"} 4.362780238
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.99"} 4.362780238
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="1"} 4.362780238

sync disabled and write_map off as well:

rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.5"} 0.116620899
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.9"} 0.833232106
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.97"} 1.035332535
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.99"} 1.035332535
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="1"} 1.035332535

More testing needed and:

The new branch / mdbx version helps
ZFS has an issue with sync writes, which is well documented and known
write_map is not necessary, and possibly not helpful

I am redoing this test with zstd-fast

Author

pryce-turner commented Jan 8, 2023

Thanks @yorickdowne, that's good stuff. Interesting to see that disabling sync makes things that much worse... unless I'm reading it wrong. Is this all still on your SATAs or have you switched to NVMe?

yorickdowne commented Feb 3, 2023

This is NVMe, and disabling sync made it better, not worse.

That said I've retested this, and it's still abysmal. I'll stop testing, and may resume once OpenZFS 2.2 has been released and is available on Ubuntu 22.04.

Leadership meeting of 1/31 had "Will 2.2 be branched at some point soon?" - the video recording is not (yet) on Youtube so I don't know what was said.

Author

pryce-turner commented Feb 6, 2023

Gotcha - thanks for clarifying. Is there any optimization in particular for 2.2 you're hoping for, or just a new version that might be better?

yorickdowne commented Feb 6, 2023

Directio, and docker overlayfs support

Author

pryce-turner commented Feb 6, 2023

Got it - cheers

j4ys0n commented Sep 1, 2023

what did y'all land on for the zfs config? i've got a geth archive node on zfs that i need to do something with soon. it's filling up quite quickly these days.

# zfs get all nvme1
NAME   PROPERTY              VALUE                  SOURCE
nvme1  used                  21.5T                  -
nvme1  available             80.1G                  -
nvme1  compressratio         1.00x                  -
nvme1  recordsize            128K                   default
nvme1  compression           off                    default

setup is currently 12x 4tb firecuda drives in the zfs equivalent of raid10. as you can see above, it's almost full. strange thing is, zfs tells me ~80gb is free, while proxmox tells me ~257gb is free - but compression ratio is 1.0. i could have sworn compression was on when i created it, but looks like i'm wrong. anyway, running out of space. and for the record, i'm using teku in conjunction with geth. that's on a separate raid10 nvme volume and is using around 220gb.

this node in particular isn't staking, but i do have a few staking nodes also (on different servers). so i'd likely apply the same methodology. here's what i'm thinking, in no particular order.

get bigger drives. the 8tb corsair drives have come down in price a bit. though, that's still pretty spendy and i'd rather not.
reconfigure from raid10 to draid2 or draid3 and just keep going.
leave the volume as is and use erigon. it's been a while but last time i played with it syncing an archive node had a few issues.
reconfigure the volume and use erigon - which i have a feeling is the winner, but i'm not sure.

i do have a pretty big sata volume on this machine also, i ran the archive node on that for a bit, but it killed a few of the drives (ironwolf 4tb) and the warranty process was a nightmare. the nvme volume has been pretty worry free so far.

on researching zfs's newer draid vdevs, it seems like block size / record size plays a big factor in performance of that particular configuration.

Author

pryce-turner commented Sep 6, 2023

Hey @j4ys0n, sorry only just getting to this... Are you having performance issues or are you just running out of space and want to do things as best as possible when setting up your new array?

As far as what we landed on with zfs (apart from don't do it 😅) I think hasn't deviated too much from the original recommendations. Bring your recordsize down and try to match it with the db.pagesize (not sure how to do that in geth but that's mentioned in the original for erigon). You should be able to turn compression on with very little overhead.

As far as geometry goes.. it kinda depends on how much downtime you can tolerate. If I were you I would get fewer, larger NVMe drives (to reduce the stripe width). I'd make a fast and loose pool with basic striping (no redundancy, max performance) and then get some much cheaper TBs in another pool, can even be spinners, to backup to. The beauty of ZFS as you well know (and COW in general) is that the snapshot/replication to the backup pool will be super fast since it's only updating changed blocks. Since you have 2 copies of the data anyways (raid10 mirrors) you mind as well save some money and get a performance boost. Again, main downside is downtime if a drive dies in your main stripe. My 2c.

I can't comment on draid performance, looks super cool and I'm sure is stable, just hasn't been out in the wild long enough. Hope that helps! There's also an awesome channel in the erigon discord for zfs optimizations fyi. Good luck!

yorickdowne commented Sep 6, 2023 •

edited

Loading

Draid isn’t any faster than raidz - on a per vdev or several vdev basis. The point of draid is fast hot standby, for setups that would otherwise use several raidz vdevs in the pool.

For the DB though you don’t want raidz, performance is going to be even worse. Mirror setup is the way for speed.

Now with zfs you can replace one drive in the mirror, replace the other, and you have the higher capacity. That’s one option.

pryce-turner/staking_zfs.md

Staking on ZFS

Intro

Benchmark

Pool Geometry

Pool properties

Observations / Things to watch out for

pryce-turner commented Dec 3, 2022

yorickdowne commented Dec 4, 2022

pryce-turner commented Dec 5, 2022

yorickdowne commented Jan 5, 2023 • edited Loading

pryce-turner commented Jan 8, 2023

yorickdowne commented Feb 3, 2023

pryce-turner commented Feb 6, 2023

yorickdowne commented Feb 6, 2023

pryce-turner commented Feb 6, 2023

j4ys0n commented Sep 1, 2023

pryce-turner commented Sep 6, 2023

yorickdowne commented Sep 6, 2023 • edited Loading

yorickdowne commented Jan 5, 2023 •

edited

Loading

yorickdowne commented Sep 6, 2023 •

edited

Loading