Imagine you are a machine learning engineer. You're dealing with large datasets.
You are about to perform a large scale change to the data. Or even a small change.
You wish you had your data under version control. You wish you had a git
for data.
But the data is too big for git
. So you usually just backup via a copy.
But here comes ZFS to the rescue! The advantage of using ZFS is that you can take advantage of its copy on write feature.
You need to create a dataset first:
# assuming your root pool is called rpool
zfs create rpool/data
# if you are using legacy mount
# you also need to use this
# otherwise rpool/data is already mounted under where rpool is mounted
mkdir --parents /tmp/data && \
mount -t zfs rpool/data /tmp/data
Now you need put your data into rpool/data
. Once you have done this, you take a
snapshot.
zfs snapshot rpool/data@1
Now you can work on it the data. At any time you can rollback to rpool/data@1
.
zfs rollback rpool/data@1
You can checkpoint your work by creating extra snapshots:
# do this every time you reach a milestone of changes
zfs snapshot rpool/data@2
zfs snapshot rpool/data@3
What if you want to work on a different variation of the same data?
You just fork!
zfs clone rpool/data@2 rpool/data2
# you may need to mount rpool/data2
You can then work on rpool/data2
. If you decide that this clone
should be the primary dataset, you can promote it:
zfs promote rpool/data2
Once you are done, you can destroy all clones, snapshots and datasets in that order.
The only problem with all of this, is that you need to be on a ZFS system.
You may need to use sudo
for all of the above. You need to copy the data
into the mounted dataset at the beginning. And you may also need to
tedious mounting. There's no way to easily turn a directory into a zfs dataset.
You have to first move that directory somewhere else. Then create the dataset
with that path.
Since I have a rpool
and rpool/tmp
already. I would create an extra dataset
just for these kinds of things. Something like rpool/data
. Then inside create
sub-datasets for specific workspaces. Like rpool/data/satellite-imagery
.
One just has to remember that the dataset names are not directly mapped to the
filesystem. Where you can mount rpool/data/satellite-imagery
anywhere.
Alternatively you could create a /data
directory, and put everything there.
Then in your projects, you would symlink there. That could be one place to store all the big data things. Ultimately you are most likely keeping this in a remote object store as backups anyway.