Now that we live in the Big Data, Web 3.14159 era, lots of people want to build databases that are too big to fit on a single machine. But there's a problem in the form of the CAP theorem, which states that if your network ever partitions (a machine goes down, or part of the network loses its connection to the rest) then you can keep consistency (all machines return the same answer to queries) or availability (all machines can be read from or written to), but not both. In simple terms, you can either stop and wait until the partition heals before accepting any writes (so-called CP systems, for "consistent and partition-tolerant") or carry on as best you can, knowing that different parts of the system will drift out of sync with each other (so-called AP systems, for "available and partition tolerant"). You can also go for CA, but then your database is effectively restricted to a single machine: real computers crash, and real networks go down.
One workaround for the CAP theorem, which I believe was pioneered at Amazon but which is now widespread, is to go for "eventual consistency": maintain an AP system, but ensure that every update to a given datum is eventually propagated to every node that needs to know about it. But it turns out that this is hard to do accurately. A rather clever approach to achieving eventual consistency, invented by Baquero and Moura in the late Nineties for use in disconnected mobile systems, is to build the eventual consistency into the datatypes themselves. You define an ordering of possible states, pass around full states between nodes, and merge two states by taking their least upper bound (LUB); the properties of the LUB operation ensure that, provided the network is reliable enough to transmit every update eventually to every node at least once, then eventual consistency will be achieved. These are now known as Convergent Replicated Datatypes, or CvRDTs. Another approach, CmRDTs (for Commutative Replicated Datatypes), is based on passing around update operations and proving that concurrent updates commute; this approach requires less network traffic, since operations are typically smaller than full states, but it places stronger demands on the reliability of the network layer (updates can't be duplicated, and must arrive in an order compatible with causality). CvRDTs and CmRDTs are collectively known as CRDTs. The relevant paper here is this technical report from INRIA, which defines CvRDTs and CmRDTs, proves an equivalence theorem between them (which works in the obvious, inefficient way: the "state" of a datum is represented as an ordered sequence of updates from the initial state, and updates which have been received but which cannot yet be applied are held until their precursors are received). It proves that CvRDTs achieve eventual consistency in the presence of networks that eventually deliver every message at least once and that CmRDTs achieve eventual consistency in the presence of networks that deliver every message exactly once in an order consistent with causality, and gives a bunch of examples.
For about a year now, I've been pointing anyone who'd listen at Robin Houston's excellent post On Editing Text, which sketches an approach to distributed version control based on colimits (see this recent paper for a more detailed treatment). Colimits generalise least-upper-bounds and have the same desirable uniqueness properties, but they are operation-based: in fact, least-upper-bounds are the special case of colimits in which the only allowable operation from state a to state b is "assert that a ≤ b". So, I thought, might there be a third class of CRDTs based on colimits, with the network-traffic needs of CmRDTs and the network-reliability needs of CvRDTs?
As you will have gathered from the title of this post, the answer is "alas,
no". I hereby present a proof. We will model a node in our database as a semiautomaton, ie as a device
that receives a sequence of messages from other nodes and applies each one to
its current state to determine its new state. Let Q be the set of states, q0 be
the initial state, M be the set of legal messages, and ev : List M -> Q be the
evaluation function, ie ev ms
is the state obtained by applying the
sequence of messages ms to q0. I'll use Haskell notation throughout, in which
f x
means "the function f applied to x", square brackets are used for lists
and ++ denotes list concatenation. Since ev is the evaluation function of a
semiautomaton, it must have the following property:
- amnesia: ev (xs ++ ys) = ev (zs ++ ys) if ev xs = ev zs.
Intuitively, the amnesia condition says that behaviour only depends on the current state, not the path taken to reach it.
In order to achieve eventual consistency on a network that can arbitrarily duplicate and reorder messages, the semiautomaton (Q, q0, M, ev) must satisfy the following conditions:
- commutativity: ev [x, y] = ev [y, x] (since the network may reorder messages)
- idempotence: ev [x, x] = ev [x] (since the network may duplicate messages)
We define a monoid structure on Q', the subset of Q reachable from q0, as follows:
- q0 is the identity element
- (ev xs) * (ev ys) = ev (xs ++ ys).
Lemma 1 (Q', *, q0) is a well-defined monoid.
Proof suppose ev xs = ev xs' and ev ys = ev ys'. We must show that (ev xs) * (ev ys) = (ev xs') * (ev ys').
(ev xs') * (ev ys')
= ev (xs' ++ ys')
= ev (xs ++ ys') -- by amnesia
= ev (ys' ++ xs) -- by commutativity
= ev (ys ++ xs) -- by amnesia
= ev (xs ++ ys) -- by commutativity
= (ev xs) * (ev ys)
Hence * is well-defined. It inherits associativity from ++, and q0 is an identity for * since q0 = ev [] and [] is an identity for ++. So (Q', *, q0) is a monoid. QED.
Note that this is not the same as the transition monoid commonly encountered in the theory of semiautomata, though no doubt the construction above is standard and related to the transition monoid in some way I wasn't able to find out in twenty minutes on Wikipedia. Henceforth I'll abuse notation in the conventional way and just call our monoid Q'.
We now need to prove a couple of simple results about Q' to set it up for what is to come.
Lemma 2 Q' is idempotent and commutative.
Proof Let x = ev xs and y = ev ys be elements of Q'. We must show that x*x = x and x*y = y*x.
x*x
= ev (xs ++ xs)
= ev xs -- by idempotence and commutativity of ev; exercise)
= x
x* y
= ev (xs ++ ys)
= ev (ys ++ xs)
= y * x
QED.
Now for the meat of the proof.
Lemma 3 Let (A, +) be an idempotent commutative semigroup. For all a, b in A, say that a ≤ b iff a + b = b. Then (A,≤) is a partially ordered set, and a + b is the least upper bound of {a, b} with respect to ≤.
Proof To show that (A, ≤) is a partially ordered set, we must show that ≤ is reflexive and transitive.
Reflexivity: Let a be an element of A. Then a + a = a by idempotence, so a ≤ a as required.
Transitivity: Let a, b and c be elements of A such that a ≤ b and b ≤ c. We must show that a ≤ c.
a + c
= a + (b + c)
= (a + b) + c
= b + c
= c.
So a ≤ c, as required. Hence (A, ≤) is a poset, as required.
a + b is an upper bound of {a, b}: let c = a + b. Then
a + c
= a + (a + b)
= (a + a) + b
= a + b
= c
so a ≤ c, and similarly
b + c
= b + (a + b)
= b + (b + a)
= (b + b) + a
= b + a
= a + b
= c
so c is an upper bound for {a, b}.
a + b is a least upper bound for {a, b}: Let c = a + b as before, and let d be another upper bound for {a, b}. We must show that c ≤ d.
c + d
= (a + b) + d
= a + (b + d)
= a + d
= d
so c ≤ d, as required. QED.
Corollary 4 Every idempotent commutative semigroup is a semilattice.
Corollary 4 is almost certainly a standard exercise in order theory textbooks, but it was easier for me to prove the result than it was to find a reference.
Putting together Lemma 1 and 2, we see that (Q', q0, *) is an idempotent commutative monoid, and hence (Q', *) is an idempotent commutative semigroup. Hence by Corollary 4, Q' is a semilattice.
So we may conclude that every CRDT that can achieve eventual consistency over a network that can arbitrarily reorder and duplicate messages is a semilattice. This is a partial converse of Proposition 2.1 from the INRIA report.
However, the INRIA definition also requires that messages are full states; using our notation, that M == Q'. We have not proved this; nor is it necessary. For instance, suppose M is the set {0,...,n-1} and Q' is the set of n-bit words, considered as n-element bitsets. The meaning of m ∈ M is "set bit m to 1". This gives us an add-only set. This is a semilattice (with the LUB operation given by set union, or equivalently bitwise "or"), and it will achieve eventual consistency over a network layer that is guaranteed to deliver each message at least once.
What can we say about M? Every element of Q' is expressible as ev [m1, m2, ... m_n] for some sequence of m's; so M is a generating set for Q'. Hopefully there's a well-developed theory of generating sets for semilattices, but I'm currently unaware of it...