problem: find the smallest diff between a file on the server and client, so you can send only the change. rsync does this!
Decades ago, a sliding window was hashed, moving byte by byte. This was very slow. (but can use many cores!)
The naive approach of comparing 8k chunks fails if someone added a single new byte at the beginning. This is called the boundary shift problem.
Content defined chunking (CDC) attempts to describe chunks by their content, creating a chunk boundary when the hash satisifies a condition like “ends with two zeros”.
CDC still uses expensive re-hashing of the input bytes! Hash functions with incremental “updated” operations decreased hashing costs. But CDC hashing is not associative, you can’t use multiple cores! That is, the boundaries of previous blocks affect the current state of the algorithm [fn:1].
Most of the research on CDC has the goal of producing consistently sized chunks roughly within chosen limits. (CDC relates to Merkle Trees, Merkle Search Trees, and more that doesn’t fit into five minutes)
Enter monoidal hashing! (my fav arxiv intro is [fn:2] )
This library implements a putatively-strong hash function H with the useful property that it gives a monoid homomorphism. This means there is a cheap operation
*such that given stringss1ands2,H(s1 ++ s2)= H(s1) * H(s2).
That means we can choose any blocksize we want, use as many cores/SIMD as we want, and it all works!
Cayley hashing is even bit level rather than byte level, so can detect bit insertions!
But exactly how do we use this for deduplication? Merkle Trees are immediately subsumed (if no more crypto attacks are found, see the history in [fn:2] )
I think there’s some sensible calculation that combines the bandwidth delay product between the hosts with the size of the file.
How can you be sure to beat the boundary shift problem?
- current inspiration: randomly choose blocksizes and offsets, then pitch the hashes across the wire, starting with the largest. the other end does the same, with some kind of “agreement” message when common chunks are found. (probably calculate intervals to do minimum work)
outstanding questions:
- [ ] exactly how cheap is the mappend operation? What’s the cost tradeoff?
[fn:2] Girth of the Cayley graph and Cayley hash functions [fn:1] A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication