Skip to content

Instantly share code, notes, and snippets.

@warpfork
Last active August 29, 2015 14:15
Show Gist options
  • Save warpfork/109b0b18df65579b498b to your computer and use it in GitHub Desktop.
Save warpfork/109b0b18df65579b498b to your computer and use it in GitHub Desktop.
zfs is actually not super cool
#!/bin/bash
# Summary: ZFS snapshots have a variety of remarkable limitations.
#
# - "recv" has no way to just accept snapshot data without actually checking it out onto the volume working tree.
# - it's probably going to have to be coddled by having a spare dataset off to the side that does network receives.
# - "revert" to a snapshot is unsafe for anything other than depth=1 and may destroy snapshots.
# - it has to be replaced by forking, and figuring out how to do an atomic replace. ("zfs rename" appears to be up to the task, but do note that it's precisely the atomicity level of any other mount operation; no more no less.)
# - "clone" operations, which you'd think would be the right choice for "forking" a dataset, are just bonkers: clone causes data dependencies (deletes get weird) and the clones aren't capable of receiving incremental updates (bake your noodle on that one).
# - the "promote" operation screws with the persistency of the thing you cloned *from* as well as the clone; at that point, the clone can receive incrementals, but the source *can't* (and there's no way to teach it how again, since snapshot ordering is sorta like an append-only list).
# - to be remotely sane, it looks like forking has to be implemented by a bounce through send/recv.
#
# Trying to build anything equivalent to "git checkout [branch]" (or rather, "git reset --hard [branch]")...
# in other words, something that just sets the content of a volume without destroying its ability to fetch updates...
# appears to be something that zfs is fighting tooth and nail to make into a complete mess.
# The weirdness has deep roots:
# - first and foremost, snapshots can't be transmitted without the recieve being targetting at a dataset; given that recieve is a disruptively mutating action, that pretty much means we need spare datasets to buffer it.
# - compounding that, snapshots aren't perceived as a content-addressible blobstore tag; they're percieved as having a linear relationship with each other.
# - mix those together and it means that taking new snapshots can screw with your ability to recieve snapshots.
# - unless you copy a bunch of the snapshots to a new dataset and resume from there, which is a strangely difficult request.
# As far as keeping actual storage costs in hand / getting dedup to DTRT on disk:
# - "clone" operations give you a new dataset without problems (though as previously mentioned, basically lose their wits when it comes to updating).
# - a send+recv bounce causes full new consumption problems. not very cool.
# So evidentally, you get to choose between "delete" not being a working concept, or just accept completely unnecessary storage bloat. :I
# So snapshots have, in total, very few legal operations that don't have ridiculous costs.
# Solutions?
# ...
# Mostly, give up.
# Incrementals can be had by remembering the dataset+snapshot that was last diverged from. Nothing else. Transitives are possible but get weird fast and there are lots of operations that are landmines.
# That means there's a permanent relationship to be remembered by the recipient. And the sender side is basically at the mercy of the recipient to garbage collect his own previous snapshot demands. (Wanna implement reference counting for maybe-hopefully-still-alive remote hosts that might have a claim on a resource? I don't.)
# The thing that really bugs me is that snapshots, in ZFS, just can't be treated as standalone bags of data. And you'd really think they should be.
# And the number of runtime "this doesn't work" or "this is going to cause previously transferred data to be discarded" or "this is gonna have to quietly do a non-incremental/worst-case thing" outcomes is huge.
set -eo pipefail
function story { echo -e "\n\E[1;34m===" "$@" "=========>\E[0m"; }
function story2 { echo -e "\E[0;34m---" "$@" "--->\E[0m"; }
story "cleanup"
(set -x
zpool destroy -f testpool
rm -rf /tmp/ztest/
mkdir /tmp/ztest/
)
story "new zpool and datasets"
(set -x
truncate /tmp/ztest/zfs-testpool.dev -s 1073741824
zpool create testpool -mnone /tmp/ztest/zfs-testpool.dev
)
story "prepare some input fixture data"
story2 "setup test data and snapshots"
(set -x
# most tests use ds1
zfs create -o mountpoint=/tmp/ztest/mnt/ds1 testpool/ds1
touch /tmp/ztest/mnt/ds1/alpha ; zfs snapshot testpool/ds1@alpha
touch /tmp/ztest/mnt/ds1/beta ; zfs snapshot testpool/ds1@beta
touch /tmp/ztest/mnt/ds1/gamma ; zfs snapshot testpool/ds1@gamma
# ds2 is here so we can see what happens when "unrelated" snapshots are introduced
zfs create -o mountpoint=/tmp/ztest/mnt/ds2 testpool/ds2
touch /tmp/ztest/mnt/ds2/alpha ; zfs snapshot testpool/ds2@alpha
touch /tmp/ztest/mnt/ds2/delta ; zfs snapshot testpool/ds2@delta
)
story "test snapshot wholesale"
# backstory: there's two ways to recieve a new snapshot... you can:
# - skip the create and let the recv do it for you implicitly (but then you have to deal with the mounting separately; *there's no '-o' option that recv can pass through*; and to double the fun, you can't use `mount` on filesystems like you can with snapshots, because zfs is just a fractal of trolling)
# - do recv with an -F (which may be fairly arbitrarily destructive)
story2 "play them all in one push (should be big) to an existing dataset"
(set -x
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3
zfs send -v testpool/ds1@gamma | zfs recv -F testpool/ds3@gamma
zfs list -r testpool -tall
test -f /tmp/ztest/mnt/ds3/alpha
test -f /tmp/ztest/mnt/ds3/beta
test -f /tmp/ztest/mnt/ds3/gamma
zfs destroy -r testpool/ds3
)
story2 "play them all in one push (should be big) to an implicitly created dataset"
(set -x
zfs send -v testpool/ds1@gamma | zfs recv testpool/ds3@gamma
zfs set mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3 # also does the mount, god forbid it just set a property like the manpage says and then exit
zfs list -r testpool -tall
test -f /tmp/ztest/mnt/ds3/alpha
test -f /tmp/ztest/mnt/ds3/beta
test -f /tmp/ztest/mnt/ds3/gamma
zfs destroy -r testpool/ds3
)
story2 "'recv -F' is scary and will destroy working tree data"
(set -x
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3
touch /tmp/ztest/mnt/ds3/data
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha
zfs list -r testpool -tall
test -f /tmp/ztest/mnt/ds3/alpha
test ! -f /tmp/ztest/mnt/ds3/data
zfs destroy -r testpool/ds3
)
story2 "'recv -F' will back off on snapshots existing though"
(set -x
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3
touch /tmp/ztest/mnt/ds3/data ; zfs snapshot testpool/ds3@important
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha || true
zfs destroy -r testpool/ds3
)
story "test snapshot incremental"
story2 "sending a sequence of incrementals works"
(set -x
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha
test -f /tmp/ztest/mnt/ds3/alpha
test ! -f /tmp/ztest/mnt/ds3/beta
test ! -f /tmp/ztest/mnt/ds3/gamma
zfs send -v -i testpool/ds1@alpha testpool/ds1@beta | zfs recv testpool/ds3@beta
test -f /tmp/ztest/mnt/ds3/alpha
test -f /tmp/ztest/mnt/ds3/beta
test ! -f /tmp/ztest/mnt/ds3/gamma
zfs send -v -i testpool/ds1@beta testpool/ds1@gamma | zfs recv testpool/ds3@gamma
test -f /tmp/ztest/mnt/ds3/alpha
test -f /tmp/ztest/mnt/ds3/beta
test -f /tmp/ztest/mnt/ds3/gamma
zfs list -r testpool -tall
zfs destroy -r testpool/ds3
)
story2 "sending a sequence with gaps is fine" # so '-I' is only if you care about getting the intermediate snapshots over too; it's not a correctness thing.
(set -x
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha
test -f /tmp/ztest/mnt/ds3/alpha
test ! -f /tmp/ztest/mnt/ds3/beta
test ! -f /tmp/ztest/mnt/ds3/gamma
zfs send -v -i testpool/ds1@alpha testpool/ds1@gamma | zfs recv testpool/ds3@gamma
test -f /tmp/ztest/mnt/ds3/alpha
test -f /tmp/ztest/mnt/ds3/beta
test -f /tmp/ztest/mnt/ds3/gamma
zfs list -r testpool -tall
zfs destroy -r testpool/ds3
)
story2 "sending an incremental with more than needed is NOT FINE"
(set -x
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha
test -f /tmp/ztest/mnt/ds3/alpha
test ! -f /tmp/ztest/mnt/ds3/beta
test ! -f /tmp/ztest/mnt/ds3/gamma
zfs send -v -i testpool/ds1@alpha testpool/ds1@beta | zfs recv testpool/ds3@beta
test -f /tmp/ztest/mnt/ds3/alpha
test -f /tmp/ztest/mnt/ds3/beta
test ! -f /tmp/ztest/mnt/ds3/gamma
zfs send -v -i testpool/ds1@alpha testpool/ds1@gamma | zfs recv testpool/ds3@gamma || true
zfs destroy -r testpool/ds3
)
story2 "sending an incremental to a volume with working tree changes is NOT FINE"
(set -x
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha
test -f /tmp/ztest/mnt/ds3/alpha
test ! -f /tmp/ztest/mnt/ds3/beta
test ! -f /tmp/ztest/mnt/ds3/gamma
touch /tmp/ztest/mnt/ds3/beta
zfs send -v -i testpool/ds1@alpha testpool/ds1@beta | zfs recv testpool/ds3@beta || true
zfs destroy -r testpool/ds3
)
story2 "sending an incremental the other way is fine" # as long as the working tree is unmodified of course
(set -x
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3
zfs create -o mountpoint=/tmp/ztest/mnt/ds4 testpool/ds4
touch /tmp/ztest/mnt/ds3/alpha ; zfs snapshot testpool/ds3@alpha
zfs send -v testpool/ds3@alpha | zfs recv -F testpool/ds4@alpha
test -f /tmp/ztest/mnt/ds4/alpha
test ! -f /tmp/ztest/mnt/ds4/beta
touch /tmp/ztest/mnt/ds4/beta ; zfs snapshot testpool/ds4@beta
zfs send -v -i alpha testpool/ds4@beta | zfs recv testpool/ds3
test -f /tmp/ztest/mnt/ds3/alpha
test -f /tmp/ztest/mnt/ds3/beta
zfs list -r testpool -tall
zfs destroy -r testpool/ds3
zfs destroy -r testpool/ds4
)
story2 "sending a sequence with gaps and then filling it in later is NOT FINE" # i guess once you get over the fact that snapshots are considered linear relatives and checked out immediately, this isn't surprising anymore
(set -x
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha
test -f /tmp/ztest/mnt/ds3/alpha
test ! -f /tmp/ztest/mnt/ds3/beta
test ! -f /tmp/ztest/mnt/ds3/gamma
zfs send -v -i testpool/ds1@alpha testpool/ds1@gamma | zfs recv testpool/ds3@gamma
test -f /tmp/ztest/mnt/ds3/alpha
test -f /tmp/ztest/mnt/ds3/beta
test -f /tmp/ztest/mnt/ds3/gamma
zfs send -v -i testpool/ds1@alpha testpool/ds1@beta | zfs recv testpool/ds3@beta || true
zfs destroy -r testpool/ds3
)
story "test snapshot mix-n-match"
story2 "it's a total nonstarter"
(set -x
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha
test -f /tmp/ztest/mnt/ds3/alpha
test ! -f /tmp/ztest/mnt/ds3/beta
zfs send -v testpool/ds2@delta | zfs recv -F testpool/ds3@delta || true
zfs destroy -r testpool/ds3
)
story2 "unless you're willing to throw everything away, which of course works" # but then what's the point
(set -x
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha
test -f /tmp/ztest/mnt/ds3/alpha
test ! -f /tmp/ztest/mnt/ds3/beta
zfs destroy testpool/ds3@alpha
test -f /tmp/ztest/mnt/ds3/alpha # destroying the snapshot leaves the working change (sane); the -F on the recv tosses it again.
zfs send -v testpool/ds2@delta | zfs recv -F testpool/ds3@delta || true
zfs list -r testpool -tall
zfs destroy -r testpool/ds3
)
story2 "even if the contents are nearly aligned; close (correctly) doesn't count"
(set -x
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha
test -f /tmp/ztest/mnt/ds3/alpha
test ! -f /tmp/ztest/mnt/ds3/beta
test ! -f /tmp/ztest/mnt/ds3/delta
zfs send -v -i testpool/ds2@alpha testpool/ds2@delta | zfs recv testpool/ds3@delta || true
zfs destroy -r testpool/ds3
)
story2 "if the contents are exactly aligned by earlier snap&cloning, THEN it works... but that irrevocably alters the snapshot set on the other dataset you cloned."
(set -x
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha
test -f /tmp/ztest/mnt/ds3/alpha
test ! -f /tmp/ztest/mnt/ds3/beta
zfs clone testpool/ds1@beta testpool/ds4
zfs promote testpool/ds4 # look at 'zfs list'; this STEALS @alpha and @beta snapshots from ds1 !!!
zfs list -r testpool -tall # yes indeedy, this clone+promote operation *mutated* our test fixture data. soooooo good!
test -f /tmp/ztest/mnt/ds1/beta # but ds1 had its history rewritten so that it still has the final content, for whatever that's worth to your sanity.
zfs send -v -i testpool/ds4@alpha testpool/ds4@beta | zfs recv testpool/ds3@beta
zfs list -r testpool -tall
zfs destroy -r testpool/ds3
zfs destroy -R testpool/ds4 # OH MY GOD, EVEN BETTER: ds1 *became a dependent of ds4*, so i can't remove ds4 without destroying ds1.
)
story "hang on while I scrape my brains from the walls"
story2 "prepare some input fixture data... again"
(set -x
# zfs destroy -r testpool/ds1 # don't worry, removing ds4 already did this :D <secretlybloodrage>
zfs create -o mountpoint=/tmp/ztest/mnt/ds1 testpool/ds1
touch /tmp/ztest/mnt/ds1/alpha ; zfs snapshot testpool/ds1@alpha
touch /tmp/ztest/mnt/ds1/beta ; zfs snapshot testpool/ds1@beta
touch /tmp/ztest/mnt/ds1/gamma ; zfs snapshot testpool/ds1@gamma
)
story2 "if the contents are exactly aligned by earlier snap&send+recv, then a snapshot can be copied and applied transitively"
(set -x
zfs create -o mountpoint=/tmp/ztest/mnt/ds3 testpool/ds3
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds3@alpha
test -f /tmp/ztest/mnt/ds3/alpha
test ! -f /tmp/ztest/mnt/ds3/beta
# cloning can't really help here since it doesn't make it possible to bring snapshots along, so let's do more intermediate send+recv instead
zfs create -o mountpoint=/tmp/ztest/mnt/ds4 testpool/ds4
zfs send -v testpool/ds1@alpha | zfs recv -F testpool/ds4@alpha
zfs send -v -i alpha testpool/ds1@beta | zfs recv testpool/ds4@beta
zfs send -v -i testpool/ds4@alpha testpool/ds4@beta | zfs recv testpool/ds3@beta
test -f /tmp/ztest/mnt/ds3/alpha
test -f /tmp/ztest/mnt/ds3/beta
test ! -f /tmp/ztest/mnt/ds3/gamma
zfs list -r testpool -tall
zfs destroy -r testpool/ds3
zfs destroy -r testpool/ds4
)
story "let's look at dedup and on disk sizes"
story2 "set up bigger fixture data"
zfs destroy -r testpool/ds1
zfs destroy -r testpool/ds2
(set -x
zfs list -r testpool -tall # i do hope this is empty
zfs create -o mountpoint=/tmp/ztest/mnt/ds1 testpool/ds1
head -c 3M /dev/urandom > /tmp/ztest/mnt/ds1/alpha ; zfs snapshot testpool/ds1@alpha
head -c 5M /dev/urandom > /tmp/ztest/mnt/ds1/beta ; zfs snapshot testpool/ds1@beta
head -c 7M /dev/urandom > /tmp/ztest/mnt/ds1/gamma ; zfs snapshot testpool/ds1@gamma
)
story2 "it takes up normal space with just one dataset and snapshots"
(set -x
zfs list -r testpool -tall
)
story2 "impacts of clone? no new real consumption, just references"
(set -x
zfs clone testpool/ds1@beta testpool/ds3
zfs list -r testpool -tall
zfs destroy -r testpool/ds3
)
story2 "impacts of send+recv bounce? new actual consumption, in full cost." # this is super lame
(set -x
zfs send -v testpool/ds1@beta | zfs recv testpool/ds3@beta
zfs list -r testpool -tall
zfs destroy -r testpool/ds3
)
story2 "dream on with dedup? nah, still does jack." # why do you even exist
(set -x
zfs set dedup=verify testpool
zfs send -vD testpool/ds1@beta | zfs recv testpool/ds3@beta
zfs list -r testpool -tall
zfs destroy -r testpool/ds3
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment