raft-apply-time-split-brain.md

We start with a an example which will result in split brain. Note that this example does not apply to the implementation in etcd/raft because of implementation details which we'll discuss later, but we will then present a way in which the modified example seems to apply to etcd/raft as well. For now, it is enough to imagine an implementation of Raft that fully follows the specs but happens to allow the history presented below.

The anomaly revolves around the fact that log entries may be committed without this being known to all peers (via the leader-communicated commit index).

We refer to a log entry on follower X that is committed but not known to have committed on that follower as "implicitly committed". Implicitly committed entries are common: an entry is committed when it is durably appended to the logs on a quorum of the replicas, but it is only in the next round of communication from the leader that followers can possibly learn of this fact.

A follower cannot apply an entry that is implicitly committed, so by carrying out two rounds of membership changes which are both implicitly committed on a follower, that follower will continue to use a double-lagging configuration. This in turn violates the invariant central to the simple membership change protocol, namely that only adjacent membership changes are in use concurrently, and allows that follower to win an election that leads to split brain.

We walk through the example step by step. For the log, bold will denote the commit index, and square brackets the [applied index]. For example, in the following log

e1 [e2] e3 e4 e5

The commit index is the index of the entry e4, but the log has only been applied up to (and including) e2.

Add a first implicitly committed membership change

Peer three is the leader and we start with a log containing a fully replicated empty entry e (this doesn't matter) that all peers know is committed (bold)

ID	Cfg	Log
1	1,2,3	[e]
2	1,2,3	[e]
3	1,2,3	[e]

3 now proposes an A1 = ADD_REPLICA(4). It arrives in both 1 and 2's logs, and the leader considers it committed, but the messages informing 1 and 2 of that fact are dropped. We assume 4 comes up and gets caught up all the way.

ID	Cfg	Log
1	1,2,3	[e] A1
2	1,2,3	[e] A1
3	1,2,3,4	e [A1]
4	1,2,3,4	e [A1]

Peers 3 and 4 commit and apply the first configuration change A1. The leader 3 thus uses it for future replication decisions.

Add a second implicitly committed membership change

Next, Peer 3 wants to carry out A2 = REMOVE_REPLICA(1). It needs three out of four acks for this, and the previous game repeats. Let's say 1 doesn't even receive the entry nor the fact that it commits (once it does); 2 gets the entry but never learns that it commits, and 3 and 4 get the entry and commit it, too.

ID	Cfg	Log
1	1,2,3	[e] A1
2	1,2,3	[e] A1 A2
3	1,2,3,4	e [A1] A2
4	1,2,3,4	e [A1] A2

A moment later, 3 and 4 apply the config change A2 and begin using it.

ID	Cfg	Log
1	1,2,3	[e] A1
2	1,2,3	[e] A1 A2
3	2,3,4	e A1 [A2]
4	2,3,4	e A1 [A2]

Campaign using a doubly stale configuration

Now there's a network partition between {1, 2} and {3,4}. 2 calls an election and 1 votes for it. Since 2 is using the initial configuration, this is enough to consider itself winner, and it steps up as a leader.

But 3 also still considers itself leader, and even more, is actually able to make progress perfectly well despite there being a leader at a higher term already (2). At this point, all is already lost, but we'll keep going anyway.

Let's say 3 commits and applies some more data records (for example user writes) which it can do since {3,4} is a quorum of {2,3,4}:

ID	Cfg	Log
1	1,2,3	[e] A1
2	1,2,3	[e] A1 A2
3	2,3,4	e A1 A2 x y [z]
4	2,3,4	e A1 A2 x y [z]

in the meantime, {1,2} also sees some incoming proposals, though they're only queued in the log at the leader 2:

ID	Cfg	Log
1	1,2,3	[e] A1
2	1,2,3	[e] A1 A2 a b c
3	2,3,4	e A1 A2 x y [z]
4	2,3,4	e A1 A2 x y [z]

2 now begins to do the work that has been queued up. It distributes the log to 1 (which is enough to commit it) and lets 1 know:

ID	Cfg	Log
1	1,2,3	[e] A1 A2 a b c
2	1,2,3	[e] A1 A2 a b c
3	2,3,4	e A1 A2 x y [z]
4	2,3,4	e A1 A2 x y [z]

Now 2 applies the newly committed log indexes. First it sees two configuration changes which it will activate for future quorum decisions (there won't be any in this example), and then it applies a, b, and c to the state machine (which also definitely tells clients that the commands were successfully committed). 1 does the same.

Next, the partition heals. The two leaders get in touch with each other, and one is going to overwrite the others' log, replacing committed records (unless some assertion kills one or both leaders; doesn't matter -- the damage is done).

There are variations of this argument that use the fact that the commit index known to a peer can regress when peers restart, so moving to commit-time instead of applied-time activation of configuration changes does not prevent this kind of problem. However, apply-time is strictly worse. For one, there is no requirement that replicas need to apply committed changes at all, that is, they can lag behind as much as they want, and can use configuration changes many generations old. There are very straightforward counter-examples found in this alone, though we opt for one that is more intricate to show the difficulties in trying to patch the algorithm.

Modifying the example for etcd/raft

The above examples will not cause problems in the etcd/raft implementation, though this seems largely incidental. In etcd/raft, the commit index is communicated with each new append, which leads to the second membership change to force the first one to be explicitly committed. This upholds the invariant that only adjacent configurations may be active concurrently if it also forces the first change to be applied in the same Raft cycle.

It turns out that a etcd/raft follower does not necessarily apply the config change immediately when it is marked as committed, because etcd/raft limits the size of the total slice of entries handed to userspace for application to the state machine. This mechanism exists because loading too many entries into memory is dangerous, but in effect it allows us to fix the example: by adding regular commands to the logs as "fillers" and assuming that 2, we can arbitrarily increase the number of Raft cycles until 2 actually applies the first configuration change. However, it will help commit the second configuration in the meantime, though, so that 3 and 4 can start using it already. This puts us back in the situation needed to end up in split brain: 1 and 2 use the initial config {1,2,3}, while 3 and 4 use the final one {2,3,4}.

It's also worth pointing out that decoupling the application of commands to the state machine from their commit further is interesting from a performance point of view. As rare as this counterexample is today (never observed as far as I can tell), it may become less rare in the future.

Note also that there's an "ingredient" we haven't used yet: the commit index may, in principle, regress because there's no requirement to sync it to disk durably. etcd/raft tends to sync it quite frequently because it asks to have it synced whenever a log entry is appended. This is another area in which future performance improvements may rip open new holes.

tbg/raft-apply-time-split-brain.md

Add a first implicitly committed membership change

Add a second implicitly committed membership change

Campaign using a doubly stale configuration

Modifying the example for etcd/raft