- You are deploying a tree from a single master to a broad number of clients.
- You want to minimize the CPU and network cost to the master.
- You may use dumb caching mirrors to assist.
- It is assumed that there is no need to keep any of the objects secret.
- rsync: doesn't take advantage of special knowledge of previous versions.
- git: requires far too much knowledge to be passed when only the latest version is needed at any given moment; does not allow delta-only fetches, might get you rate-limited by github.
- zip of deltas: keeps too many obsolete files from old versions, but still better than the above. This is what TMW uses.
- svn: not yet investigated, but doesn't meaningfully support branches (not necessarily needed for stable users, but definitely needed for testers).
- gittorrent: not yet investigated, https://github.com/cjb/GitTorrent
- version: a top-level tree, possibly with an additional name (and maybe other metadata). Only versions can be directly diff'ed in requests. Depending on your use case, a version might correspond to a git commit, or possibly a tag.
- diff: the one-way (like
git-diff --irreversible-delete) difference between two trees (see Protocol below). Only version diffs can be requested, but the response may include diffs of subtrees. - tree: a set of (name, set of attr, object) tuples
- object: a tree or blob. Every object is addressable by content (similar to a git hash)
- blob: the contents of a file.
- attr: any file attribute, including basics such as "executable", "symlink", as well as extended attributes. Note that "directory" is not included since it is special. Most likely these should be put in a separate blob (look at etckeeper?).
- master: any machine that knows about all distribution versions. This need not actually be a single machine, but only the mirrors will directly access it.
- mirror: a dumb caching proxy. Note that some files may be meaningfully cached for a long time, but others may only be used once.
- client: somebody who wants the latest version, possibly given some older version. A client may also decide (or be forced) to roll back to a previous version, which they may already (or still) have or not, so it should keep several versions around. In particular, it should always keep the require-first versions around, recursively if need be.
master
/ | \
/ | \
mirror mirror mirror
/ \ /
client client
(Note that mirrors could possibly be replaced by peers (i.e. bittorrent))
Hm, would statically generating all the possible requests (by careful choice of "key" versions?) be plausible to turn mirrors into mere file collections rather than caching proxies?
- Through external means, client receives an idea of what version they want.
- If client is not new, send a
want-diff($old, $new)request to the mirrors. - If client is new, send
want-full($new) - If server does not have
$old, returnfull-tree($new.tree) - Else, server returns
diff-tree($old.tree, $new.tree) - Client keeps current version (and any
require-firstversions) and possibly some older versions (if full-tree rollbacks are meaningful)
- full-tree(tree): basically just a tar/zip file (actually, for the initial clone, this probably should just be one, then extract the version from the hash using
git tar-commit-idif your versions are commits) - diff-tree(old, new): can be basically git-diff -D, but here may be better implementations
- want-diff (old-version, new-version)
- want-full (new-version)
- want-skeleton-diff (old-version, new-version)
- want-skeleton-full (new-version)
- want-single-object (hash)
- require-first (version): usually sent as the first command in response to want-diff, but can be sent at any time, including more than once, which may necessitate the client initiating additional want-full's. It might not even be the same version as $old in want-diff. All this does is ensure that the object corresponding to old-hash is available for have-diff.
- have-object (hash, data): full contents of an object (blob or tree, it is oblivious here). If it is a tree, it is probably followed by have-object or have-diff for all of its children.
- have-diff (old-hash, new-hash, delta): delta contents of an object (blob or tree, it is oblivious here). If it is a tree, it is probably followed by have-diff or have-object for each of its changed children.
- What about branches/forks? A given master server should not need to keep knowledge about temporary branches, and different master servers might not know about each other's forks. Minimal solution: a "unstable" tag that can be applied to a version, indicating that it might not always be available. The client then requests back to the latest "stable" version.
- Should there be skeleton (tree-only) options? If so, need to guarantee that diffs for trees will not depend on objects.
- If there are no skeleton option, is there any point to want-single-object?
- Should have-object and have-diff be split for trees vs blobs?
- Should want-single-object be split for trees? Should it have a non-recurse version?
- Should there be a version-to-tree command?
We need more real-world information in order to answer:
I've always thought it was dumb that package managers insisted on unconditionally fetching all meta-information, instead of just fetching minimal information (i.e. the package name) and then only fetching the rest when it was actually needed.
Also, with ManaMobile, apparently files are only fetched on an individual basis, but I don't know if that ever had any scalability testing.