Skip to content

Instantly share code, notes, and snippets.

@gmaclennan
Created August 7, 2018 23:16
Show Gist options
  • Save gmaclennan/ef95ba15c5b148434581b7468f8e56c4 to your computer and use it in GitHub Desktop.
Save gmaclennan/ef95ba15c5b148434581b7468f8e56c4 to your computer and use it in GitHub Desktop.
Ideas for a file format for mapeo sync

Problem

We want to be able to synchronize the data in mapeo-core to disk so that it can be transported via "sneakernet" to another device. This would most likely be on a USB flash drive, but could also be sent over email or a local bluetooth connection.

Currently safe-fs-blob-store - an implementation of abstract-blob-store - is used for storing media (attachments) in the mapeo-core database. On disk the blobs as stored as files and folders. This is potentially a problem for moving the archive around on disk because a user could easily, accidentally or intentionally, move files and folders and result in data loss and a corrupt database.

Requirements

Ideally we would have a single file containing all attachments and all database data that can be used to sync between machines.

Limitations

We want to use this file for two-way sync: reading and writing new data to it. Currently in hyperlog-sneakernet-replicator we store a leveldb database in a tarball, we extract the files, sync, and re-tarball the files. This would nor work with the attachments database because of space limitations on user devices. mapeo-core is used on mobile devices which oftern only have 2-5GB of free space. A user with 1GB of photo and video attachments would need disk space for the photos in their own database, then everything in the sync file, then all the same data again extracted from the archive.

Most archive formats are read-only, or append only (e.g. tar) and they do not support random access. One exception is dictzip which is compatible with gzip. See also idzip.

We also need to work with the limitations of FAT32 disk formats, which is the common format for USB flash drives. FAT32 has a max filesize limitation of 4GB. A media library could easily grow beyond this, so having everything in a single sync file is impossible for sneakernet?

Solutions

If we are going to sync a folder of files with sneakernet we need some good checks in place to detect moved or deleted files and recover and fail gracefully. One solution would be to generate an index file in the root of the attachments folder with an index of the files in the store. We could use this as an integrity check to make sure no files are missing or have been added by an outside process. We could include hashes of files if we wanted to ensure integretity, but this could come at a performance cost when we might not have a clearly defined problem that we are solving with this.

What would this integrity-check file look like? How would we handle: missing files? renamed files? deleted files? moved files? new unexpected files?

If we are going to sync to a single file, putting aside the 4GB limit, we could use a tarball for the attachment store. Our attachment database is immutable - photos, videos etc. are never modified or deleted. We could write an abstract-blob-store implmentation tar-blob-store that stored everything in a tarball and maintained a separate index file pointing to file offsets, allowing for quick random access. This index file would not need to be included with the sync file, since it could be generated relatively quick when needed for sync, it should be possible to implement a simple indexer that quickly walks the tarball and writes the filenames and offsets to a simple file format / memory. Here is a six-year-old implementation.

We still have two files though: the blob-store tarball and the p2p-db tarball. We could create a mapeo-sync file format that is compatible with tar, simply another tarball in three sections:

  1. A fixed-length file (e.g. 512 bytes) which is always the first in the tarball, which identifies the offset of the start of the p2p-db data. Since it is fixed-length and always at the start, we could write/modify this data without re-writing the tarball.
  2. Followed by all the blob-store files
  3. Followed by all the p2p-db files

For sync we would not need to extract the blob-store files. We could extract the p2p-db files off the end, then slice/shorten the tarball to remove them, then sync appending additional blob-store files to the end, then re-write the newly synced p2p-db files to the end again, then finally write the p2p-db file offset to the file at the start of the tarball.

The advantage of this format would be that you could always just change the file-extension to .tar and extract it to see all the files.

We could also use the "special" fixed-length first file to list additional sync files, e.g. if the file grew over 4GB we could have my-db.mapeosync, my-db.mapeosync.1, my-db.mapeosync.2 etc. The header file could include a list of these files and optionally a hash, so we could check their integrity.

Problems / Challenges

It's not actually true that the blob-store is immutable. In the future we want to be able to delete original media from a phone's blob-store to free up space, replacing the images with thumbnails / medium-sized previews. However, this is for the internal database, not the sync file, it is probably fine that a sync file never has files deleted. Since it is just a replica of data on a device, if it got to large you could just generate a new one - we could implement client-side code to decide whether to include original or thumbnail media in a sync file.

@hackergrrl
Copy link

@gmaclennan What about the ZIP format? It has some appealing properties:

  • Supports "store" mode (no compression)
  • Has random file access by reading a "central directory record" trailer at the end of the archive
  • Supports file deletion and modification (by omitting entries in the lookup table or changing their offset)
  • Supports archives spanning multiple files (create a virtual union set of all of the ZIP files' central directories)

We could still do your trick of cutting off the p2p db to add more media, since the central directory record is at the end of the file and is easy to regenerate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment