Last active
August 29, 2015 14:15
-
-
Save warpfork/109b0b18df65579b498b to your computer and use it in GitHub Desktop.
zfs is actually not super cool
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# Summary: ZFS snapshots have a variety of remarkable limitations. | |
# | |
# - "recv" has no way to just accept snapshot data without actually checking it out onto the volume working tree. | |
# - it's probably going to have to be coddled by having a spare dataset off to the side that does network receives. | |
# - "revert" to a snapshot is unsafe for anything other than depth=1 and may destroy snapshots. | |
# - it has to be replaced by forking, and figuring out how to do an atomic replace. ("zfs rename" appears to be up to the task, but do note that it's precisely the atomicity level of any other mount operation; no more no less.) | |
# - "clone" operations, which you'd think would be the right choice for "forking" a dataset, are just bonkers: clone causes data dependencies (deletes get weird) and the clones aren't capable of receiving incremental updates (bake your noodle on that one). | |
# - the "promote" operation screws with the persistency of the thing you cloned *from* as well as the clone; at that point, the clone can receive incrementals, but the source *can't* (and there's no way to teach it how again, since snapshot ordering is sorta like an append-only list). | |
# - to be remotely sane, it looks like forking has to be implemented by a bounce through send/recv. | |
# | |
# Trying to build anything equivalent to "git checkout [branch]" (or rather, "git reset --hard [branch]")... | |
# in other words, something that just sets the content of a volume without destroying its ability to fetch updates... | |
# appears to be something that zfs is fighting tooth and nail to make into a complete mess. | |
# The weirdness has deep roots: | |
# - first and foremost, snapshots can't be transmitted without the recieve being targetting at a dataset; given that recieve is a disruptively mutating action, that pretty much means we need spare datasets to buffer it. | |
# - compounding that, snapshots aren't perceived as a content-addressible blobstore tag; they're percieved as having a linear relationship with each other. | |
# - mix those together and it means that taking new snapshots can screw with your ability to recieve snapshots. | |
# - unless you copy a bunch of the snapshots to a new dataset and resume from there, which is a strangely difficult request. | |
# As far as keeping actual storage costs in hand / getting dedup to DTRT on disk: | |
# - "clone" operations give you a new dataset without problems (though as previously mentioned, basically lose their wits when it comes to updating). | |
# - a send+recv bounce causes full new consumption problems. not very cool. | |
# So evidentally, you get to choose between "delete" not being a working concept, or just accept completely unnecessary storage bloat. :I | |
# So snapshots have, in total, very few legal operations that don't have ridiculous costs. | |
# Solutions? | |
# ... | |
# Mostly, give up. | |
# Incrementals can be had by remembering the dataset+snapshot that was last diverged from. Nothing else. Transitives are possible but get weird fast and there are lots of operations that are landmines. | |
# That means there's a permanent relationship to be remembered by the recipient. And the sender side is basically at the mercy of the recipient to garbage collect his own previous snapshot demands. (Wanna implement reference counting for maybe-hopefully-still-alive remote hosts that might have a claim on a resource? I don't.) | |
# The thing that really bugs me is that snapshots, in ZFS, just can't be treated as standalone bags of data. And you'd really think they should be. | |
# And the number of runtime "this doesn't work" or "this is going to cause previously transferred data to be discarded" or "this is gonna have to quietly do a non-incremental/worst-case thing" outcomes is huge. | |
set -eo pipefail | |
function story { echo -e "\n\E[1;34m===" "$@" "=========>\E[0m"; } | |
function story2 { echo -e "\E[0;34m---" "$@" "--->\E[0m"; } | |
story "cleanup" | |
(set -x | |
zpool destroy -f testpool | |
rm -rf /tmp/ztest/ | |
mkdir /tmp/ztest/ | |
) | |
story "new zpool and datasets" | |
(set -x | |
truncate /tmp/ztest/zfs-testpool.dev -s 1073741824 | |
zpool create testpool -mnone /tmp/ztest/zfs-testpool.dev | |
) | |
story "prepare some input fixture data" | |
story2 "setup test data and snapshots" | |
(set -x | |
# most tests use ds1 | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds1 testpool/ds1 | |
touch /tmp/ztest/mnt/ds1/alpha ; zfs snapshot testpool/ds1@alpha | |
touch /tmp/ztest/mnt/ds1/beta ; zfs snapshot testpool/ds1@beta | |
touch /tmp/ztest/mnt/ds1/gamma ; zfs snapshot testpool/ds1@gamma | |
# ds2 is here so we can see what happens when "unrelated" snapshots are introduced | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds2 testpool/ds2 | |
touch /tmp/ztest/mnt/ds2/alpha ; zfs snapshot testpool/ds2@alpha | |
touch /tmp/ztest/mnt/ds2/delta ; zfs snapshot testpool/ds2@delta | |
) | |
story "test snapshot wholesale" | |
# backstory: there's two ways to recieve a new snapshot... you can: | |
# - skip the create and let the recv do it for you implicitly (but then you have to deal with the mounting separately; *there's no '-o' option that recv can pass through*; and to double the fun, you can't use `mount` on filesystems like you can with snapshots, because zfs is just a fractal of trolling) | |
# - do recv with an -F (which may be fairly arbitrarily destructive) | |
story2 "play them all in one push (should be big) to an existing dataset" | |
(set -x | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 | |
zfs send -v testpool/ds1@gamma | zfs recv -F testpool/ds3@gamma | |
zfs list -r testpool -tall | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test -f /tmp/ztest/mnt/ds3/beta | |
test -f /tmp/ztest/mnt/ds3/gamma | |
zfs destroy -r testpool/ds3 | |
) | |
story2 "play them all in one push (should be big) to an implicitly created dataset" | |
(set -x | |
zfs send -v testpool/ds1@gamma | zfs recv testpool/ds3@gamma | |
zfs set mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 # also does the mount, god forbid it just set a property like the manpage says and then exit | |
zfs list -r testpool -tall | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test -f /tmp/ztest/mnt/ds3/beta | |
test -f /tmp/ztest/mnt/ds3/gamma | |
zfs destroy -r testpool/ds3 | |
) | |
story2 "'recv -F' is scary and will destroy working tree data" | |
(set -x | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 | |
touch /tmp/ztest/mnt/ds3/data | |
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha | |
zfs list -r testpool -tall | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test ! -f /tmp/ztest/mnt/ds3/data | |
zfs destroy -r testpool/ds3 | |
) | |
story2 "'recv -F' will back off on snapshots existing though" | |
(set -x | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 | |
touch /tmp/ztest/mnt/ds3/data ; zfs snapshot testpool/ds3@important | |
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha || true | |
zfs destroy -r testpool/ds3 | |
) | |
story "test snapshot incremental" | |
story2 "sending a sequence of incrementals works" | |
(set -x | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 | |
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test ! -f /tmp/ztest/mnt/ds3/beta | |
test ! -f /tmp/ztest/mnt/ds3/gamma | |
zfs send -v -i testpool/ds1@alpha testpool/ds1@beta | zfs recv testpool/ds3@beta | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test -f /tmp/ztest/mnt/ds3/beta | |
test ! -f /tmp/ztest/mnt/ds3/gamma | |
zfs send -v -i testpool/ds1@beta testpool/ds1@gamma | zfs recv testpool/ds3@gamma | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test -f /tmp/ztest/mnt/ds3/beta | |
test -f /tmp/ztest/mnt/ds3/gamma | |
zfs list -r testpool -tall | |
zfs destroy -r testpool/ds3 | |
) | |
story2 "sending a sequence with gaps is fine" # so '-I' is only if you care about getting the intermediate snapshots over too; it's not a correctness thing. | |
(set -x | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 | |
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test ! -f /tmp/ztest/mnt/ds3/beta | |
test ! -f /tmp/ztest/mnt/ds3/gamma | |
zfs send -v -i testpool/ds1@alpha testpool/ds1@gamma | zfs recv testpool/ds3@gamma | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test -f /tmp/ztest/mnt/ds3/beta | |
test -f /tmp/ztest/mnt/ds3/gamma | |
zfs list -r testpool -tall | |
zfs destroy -r testpool/ds3 | |
) | |
story2 "sending an incremental with more than needed is NOT FINE" | |
(set -x | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 | |
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test ! -f /tmp/ztest/mnt/ds3/beta | |
test ! -f /tmp/ztest/mnt/ds3/gamma | |
zfs send -v -i testpool/ds1@alpha testpool/ds1@beta | zfs recv testpool/ds3@beta | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test -f /tmp/ztest/mnt/ds3/beta | |
test ! -f /tmp/ztest/mnt/ds3/gamma | |
zfs send -v -i testpool/ds1@alpha testpool/ds1@gamma | zfs recv testpool/ds3@gamma || true | |
zfs destroy -r testpool/ds3 | |
) | |
story2 "sending an incremental to a volume with working tree changes is NOT FINE" | |
(set -x | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 | |
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test ! -f /tmp/ztest/mnt/ds3/beta | |
test ! -f /tmp/ztest/mnt/ds3/gamma | |
touch /tmp/ztest/mnt/ds3/beta | |
zfs send -v -i testpool/ds1@alpha testpool/ds1@beta | zfs recv testpool/ds3@beta || true | |
zfs destroy -r testpool/ds3 | |
) | |
story2 "sending an incremental the other way is fine" # as long as the working tree is unmodified of course | |
(set -x | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds4 testpool/ds4 | |
touch /tmp/ztest/mnt/ds3/alpha ; zfs snapshot testpool/ds3@alpha | |
zfs send -v testpool/ds3@alpha | zfs recv -F testpool/ds4@alpha | |
test -f /tmp/ztest/mnt/ds4/alpha | |
test ! -f /tmp/ztest/mnt/ds4/beta | |
touch /tmp/ztest/mnt/ds4/beta ; zfs snapshot testpool/ds4@beta | |
zfs send -v -i alpha testpool/ds4@beta | zfs recv testpool/ds3 | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test -f /tmp/ztest/mnt/ds3/beta | |
zfs list -r testpool -tall | |
zfs destroy -r testpool/ds3 | |
zfs destroy -r testpool/ds4 | |
) | |
story2 "sending a sequence with gaps and then filling it in later is NOT FINE" # i guess once you get over the fact that snapshots are considered linear relatives and checked out immediately, this isn't surprising anymore | |
(set -x | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 | |
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test ! -f /tmp/ztest/mnt/ds3/beta | |
test ! -f /tmp/ztest/mnt/ds3/gamma | |
zfs send -v -i testpool/ds1@alpha testpool/ds1@gamma | zfs recv testpool/ds3@gamma | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test -f /tmp/ztest/mnt/ds3/beta | |
test -f /tmp/ztest/mnt/ds3/gamma | |
zfs send -v -i testpool/ds1@alpha testpool/ds1@beta | zfs recv testpool/ds3@beta || true | |
zfs destroy -r testpool/ds3 | |
) | |
story "test snapshot mix-n-match" | |
story2 "it's a total nonstarter" | |
(set -x | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 | |
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test ! -f /tmp/ztest/mnt/ds3/beta | |
zfs send -v testpool/ds2@delta | zfs recv -F testpool/ds3@delta || true | |
zfs destroy -r testpool/ds3 | |
) | |
story2 "unless you're willing to throw everything away, which of course works" # but then what's the point | |
(set -x | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 | |
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test ! -f /tmp/ztest/mnt/ds3/beta | |
zfs destroy testpool/ds3@alpha | |
test -f /tmp/ztest/mnt/ds3/alpha # destroying the snapshot leaves the working change (sane); the -F on the recv tosses it again. | |
zfs send -v testpool/ds2@delta | zfs recv -F testpool/ds3@delta || true | |
zfs list -r testpool -tall | |
zfs destroy -r testpool/ds3 | |
) | |
story2 "even if the contents are nearly aligned; close (correctly) doesn't count" | |
(set -x | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 | |
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test ! -f /tmp/ztest/mnt/ds3/beta | |
test ! -f /tmp/ztest/mnt/ds3/delta | |
zfs send -v -i testpool/ds2@alpha testpool/ds2@delta | zfs recv testpool/ds3@delta || true | |
zfs destroy -r testpool/ds3 | |
) | |
story2 "if the contents are exactly aligned by earlier snap&cloning, THEN it works... but that irrevocably alters the snapshot set on the other dataset you cloned." | |
(set -x | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 | |
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test ! -f /tmp/ztest/mnt/ds3/beta | |
zfs clone testpool/ds1@beta testpool/ds4 | |
zfs promote testpool/ds4 # look at 'zfs list'; this STEALS @alpha and @beta snapshots from ds1 !!! | |
zfs list -r testpool -tall # yes indeedy, this clone+promote operation *mutated* our test fixture data. soooooo good! | |
test -f /tmp/ztest/mnt/ds1/beta # but ds1 had its history rewritten so that it still has the final content, for whatever that's worth to your sanity. | |
zfs send -v -i testpool/ds4@alpha testpool/ds4@beta | zfs recv testpool/ds3@beta | |
zfs list -r testpool -tall | |
zfs destroy -r testpool/ds3 | |
zfs destroy -R testpool/ds4 # OH MY GOD, EVEN BETTER: ds1 *became a dependent of ds4*, so i can't remove ds4 without destroying ds1. | |
) | |
story "hang on while I scrape my brains from the walls" | |
story2 "prepare some input fixture data... again" | |
(set -x | |
# zfs destroy -r testpool/ds1 # don't worry, removing ds4 already did this :D <secretlybloodrage> | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds1 testpool/ds1 | |
touch /tmp/ztest/mnt/ds1/alpha ; zfs snapshot testpool/ds1@alpha | |
touch /tmp/ztest/mnt/ds1/beta ; zfs snapshot testpool/ds1@beta | |
touch /tmp/ztest/mnt/ds1/gamma ; zfs snapshot testpool/ds1@gamma | |
) | |
story2 "if the contents are exactly aligned by earlier snap&send+recv, then a snapshot can be copied and applied transitively" | |
(set -x | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 | |
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test ! -f /tmp/ztest/mnt/ds3/beta | |
# cloning can't really help here since it doesn't make it possible to bring snapshots along, so let's do more intermediate send+recv instead | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds4 testpool/ds4 | |
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds4@alpha | |
zfs send -v -i alpha testpool/ds1@beta | zfs recv testpool/ds4@beta | |
zfs send -v -i testpool/ds4@alpha testpool/ds4@beta | zfs recv testpool/ds3@beta | |
test -f /tmp/ztest/mnt/ds3/alpha | |
test -f /tmp/ztest/mnt/ds3/beta | |
test ! -f /tmp/ztest/mnt/ds3/gamma | |
zfs list -r testpool -tall | |
zfs destroy -r testpool/ds3 | |
zfs destroy -r testpool/ds4 | |
) | |
story "let's look at dedup and on disk sizes" | |
story2 "set up bigger fixture data" | |
zfs destroy -r testpool/ds1 | |
zfs destroy -r testpool/ds2 | |
(set -x | |
zfs list -r testpool -tall # i do hope this is empty | |
zfs create -o mountpoint=/tmp/ztest/mnt/ds1 testpool/ds1 | |
head -c 3M /dev/urandom > /tmp/ztest/mnt/ds1/alpha ; zfs snapshot testpool/ds1@alpha | |
head -c 5M /dev/urandom > /tmp/ztest/mnt/ds1/beta ; zfs snapshot testpool/ds1@beta | |
head -c 7M /dev/urandom > /tmp/ztest/mnt/ds1/gamma ; zfs snapshot testpool/ds1@gamma | |
) | |
story2 "it takes up normal space with just one dataset and snapshots" | |
(set -x | |
zfs list -r testpool -tall | |
) | |
story2 "impacts of clone? no new real consumption, just references" | |
(set -x | |
zfs clone testpool/ds1@beta testpool/ds3 | |
zfs list -r testpool -tall | |
zfs destroy -r testpool/ds3 | |
) | |
story2 "impacts of send+recv bounce? new actual consumption, in full cost." # this is super lame | |
(set -x | |
zfs send -v testpool/ds1@beta | zfs recv testpool/ds3@beta | |
zfs list -r testpool -tall | |
zfs destroy -r testpool/ds3 | |
) | |
story2 "dream on with dedup? nah, still does jack." # why do you even exist | |
(set -x | |
zfs set dedup=verify testpool | |
zfs send -vD testpool/ds1@beta | zfs recv testpool/ds3@beta | |
zfs list -r testpool -tall | |
zfs destroy -r testpool/ds3 | |
) | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment