Skip to content

Instantly share code, notes, and snippets.

@o11c
Last active March 10, 2016 02:37
Show Gist options
  • Select an option

  • Save o11c/6b2f001ab8d3c81ce517 to your computer and use it in GitHub Desktop.

Select an option

Save o11c/6b2f001ab8d3c81ce517 to your computer and use it in GitHub Desktop.
Thoughts on distribution

Problem:

  • You are deploying a tree from a single master to a broad number of clients.
  • You want to minimize the CPU and network cost to the master.
  • You may use dumb caching mirrors to assist.
  • It is assumed that there is no need to keep any of the objects secret.

Alternatives:

  • rsync: doesn't take advantage of special knowledge of previous versions.
  • git: requires far too much knowledge to be passed when only the latest version is needed at any given moment; does not allow delta-only fetches, might get you rate-limited by github.
  • zip of deltas: keeps too many obsolete files from old versions, but still better than the above. This is what TMW uses.
  • svn: not yet investigated, but doesn't meaningfully support branches (not necessarily needed for stable users, but definitely needed for testers).
  • gittorrent: not yet investigated, https://github.com/cjb/GitTorrent

Definitions:

  • version: a top-level tree, possibly with an additional name (and maybe other metadata). Only versions can be directly diff'ed in requests. Depending on your use case, a version might correspond to a git commit, or possibly a tag.
  • diff: the one-way (like git-diff --irreversible-delete) difference between two trees (see Protocol below). Only version diffs can be requested, but the response may include diffs of subtrees.
  • tree: a set of (name, set of attr, object) tuples
  • object: a tree or blob. Every object is addressable by content (similar to a git hash)
  • blob: the contents of a file.
  • attr: any file attribute, including basics such as "executable", "symlink", as well as extended attributes. Note that "directory" is not included since it is special. Most likely these should be put in a separate blob (look at etckeeper?).
  • master: any machine that knows about all distribution versions. This need not actually be a single machine, but only the mirrors will directly access it.
  • mirror: a dumb caching proxy. Note that some files may be meaningfully cached for a long time, but others may only be used once.
  • client: somebody who wants the latest version, possibly given some older version. A client may also decide (or be forced) to roll back to a previous version, which they may already (or still) have or not, so it should keep several versions around. In particular, it should always keep the require-first versions around, recursively if need be.

Architecture:

            master
           /  |  \
         /    |    \
    mirror  mirror  mirror
    /   \   /
client  client

(Note that mirrors could possibly be replaced by peers (i.e. bittorrent))

Hm, would statically generating all the possible requests (by careful choice of "key" versions?) be plausible to turn mirrors into mere file collections rather than caching proxies?

Design:

  • Through external means, client receives an idea of what version they want.
  • If client is not new, send a want-diff($old, $new) request to the mirrors.
  • If client is new, send want-full($new)
  • If server does not have $old, return full-tree($new.tree)
  • Else, server returns diff-tree($old.tree, $new.tree)
  • Client keeps current version (and any require-first versions) and possibly some older versions (if full-tree rollbacks are meaningful)

Algorithms:

  • full-tree(tree): basically just a tar/zip file (actually, for the initial clone, this probably should just be one, then extract the version from the hash using git tar-commit-id if your versions are commits)
  • diff-tree(old, new): can be basically git-diff -D, but here may be better implementations

Protocol (sent from client to server, as HTTP GET):

  • want-diff (old-version, new-version)
  • want-full (new-version)
  • want-skeleton-diff (old-version, new-version)
  • want-skeleton-full (new-version)
  • want-single-object (hash)

Commands (sent from server to client, in batches, to make it complete):

  • require-first (version): usually sent as the first command in response to want-diff, but can be sent at any time, including more than once, which may necessitate the client initiating additional want-full's. It might not even be the same version as $old in want-diff. All this does is ensure that the object corresponding to old-hash is available for have-diff.
  • have-object (hash, data): full contents of an object (blob or tree, it is oblivious here). If it is a tree, it is probably followed by have-object or have-diff for all of its children.
  • have-diff (old-hash, new-hash, delta): delta contents of an object (blob or tree, it is oblivious here). If it is a tree, it is probably followed by have-diff or have-object for each of its changed children.

Unsolved questions:

  • What about branches/forks? A given master server should not need to keep knowledge about temporary branches, and different master servers might not know about each other's forks. Minimal solution: a "unstable" tag that can be applied to a version, indicating that it might not always be available. The client then requests back to the latest "stable" version.
  • Should there be skeleton (tree-only) options? If so, need to guarantee that diffs for trees will not depend on objects.
  • If there are no skeleton option, is there any point to want-single-object?
  • Should have-object and have-diff be split for trees vs blobs?
  • Should want-single-object be split for trees? Should it have a non-recurse version?
  • Should there be a version-to-tree command?
@o11c

o11c commented Mar 10, 2016

Copy link
Copy Markdown
Author

We need more real-world information in order to answer:

  • Are there going to be enough unused blobs that skeleton fetches are worthwhile?

I've always thought it was dumb that package managers insisted on unconditionally fetching all meta-information, instead of just fetching minimal information (i.e. the package name) and then only fetching the rest when it was actually needed.

Also, with ManaMobile, apparently files are only fetched on an individual basis, but I don't know if that ever had any scalability testing.

@o11c

o11c commented Mar 10, 2016

Copy link
Copy Markdown
Author

It looks like there are a family of rsync-based tools that can take advantage of known versions: rdiff, rdiffdir, rdiff-backup. I'm not sure how well they actually apply directly to this use case, but they may be useful for the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment