Using ZFS to Version Control Large Datasets

Imagine you are a machine learning engineer. You're dealing with large datasets.

You are about to perform a large scale change to the data. Or even a small change.

You wish you had your data under version control. You wish you had a git for data.

But the data is too big for git. So you usually just backup via a copy.

But here comes ZFS to the rescue! The advantage of using ZFS is that you can take advantage of its copy on write feature.

You need to create a dataset first:

# assuming your root pool is called rpool
zfs create rpool/data

# if you are using legacy mount
# you also need to use this
# otherwise rpool/data is already mounted under where rpool is mounted
mkdir --parents /tmp/data && \
mount -t zfs rpool/data /tmp/data

Now you need put your data into rpool/data. Once you have done this, you take a snapshot.

zfs snapshot rpool/data@1

Now you can work on it the data. At any time you can rollback to rpool/data@1.

zfs rollback rpool/data@1

You can checkpoint your work by creating extra snapshots:

# do this every time you reach a milestone of changes
zfs snapshot rpool/data@2
zfs snapshot rpool/data@3

What if you want to work on a different variation of the same data?

You just fork!

zfs clone rpool/data@2 rpool/data2
# you may need to mount rpool/data2

You can then work on rpool/data2. If you decide that this clone should be the primary dataset, you can promote it:

zfs promote rpool/data2

Once you are done, you can destroy all clones, snapshots and datasets in that order.

The only problem with all of this, is that you need to be on a ZFS system. You may need to use sudo for all of the above. You need to copy the data into the mounted dataset at the beginning. And you may also need to tedious mounting. There's no way to easily turn a directory into a zfs dataset. You have to first move that directory somewhere else. Then create the dataset with that path.

Since I have a rpool and rpool/tmp already. I would create an extra dataset just for these kinds of things. Something like rpool/data. Then inside create sub-datasets for specific workspaces. Like rpool/data/satellite-imagery.

One just has to remember that the dataset names are not directly mapped to the filesystem. Where you can mount rpool/data/satellite-imagery anywhere.

Alternatively you could create a /data directory, and put everything there.

Then in your projects, you would symlink there. That could be one place to store all the big data things. Ultimately you are most likely keeping this in a remote object store as backups anyway.

xk2600/using_zfs_version_control_large_datasets.md

Using ZFS to Version Control Large Datasets