We start with a an example which will result in split brain. Note that this
example does not apply to the implementation in etcd/raft
because of
implementation details which we'll discuss later, but we will then present a way
in which the modified example seems to apply to etcd/raft
as well. For now, it
is enough to imagine an implementation of Raft that fully follows the specs but
happens to allow the history presented below.
The anomaly revolves around the fact that log entries may be committed without this being known to all peers (via the leader-communicated commit index).
We refer to a log entry on follower X that is committed but not known to have committed on that follower as "implicitly committed". Implicitly committed entries are common: an entry is committed when it is durably appended to the logs on a quorum of the replicas, but it is only in the next round of communication from the leader that followers can possibly learn of this fact.
A follower cannot apply an entry that is implicitly committed, so by carrying out two rounds of membership changes which are both implicitly committed on a follower, that follower will continue to use a double-lagging configuration. This in turn violates the invariant central to the simple membership change protocol, namely that only adjacent membership changes are in use concurrently, and allows that follower to win an election that leads to split brain.
We walk through the example step by step. For the log, bold will denote the commit
index, and square brackets the [applied index]
. For example, in the following log
e1 [e2] e3 e4 e5
The commit index is the index of the entry e4
, but the log has only been applied
up to (and including) e2
.
Peer three is the leader and we start with a log containing a fully replicated
empty entry e
(this doesn't matter) that all peers know is committed (bold)
ID | Cfg | Log |
---|---|---|
1 | 1,2,3 | [e] |
2 | 1,2,3 | [e] |
3 | 1,2,3 | [e] |
3
now proposes an A1 = ADD_REPLICA(4)
. It arrives in both 1
and 2
's logs, and
the leader considers it committed, but the messages informing 1
and 2
of that fact
are dropped. We assume 4
comes up and gets caught up all the way.
ID | Cfg | Log |
---|---|---|
1 | 1,2,3 | [e] A1 |
2 | 1,2,3 | [e] A1 |
3 | 1,2,3,4 | e [A1] |
4 | 1,2,3,4 | e [A1] |
Peers 3
and 4
commit and apply the first configuration change A1
. The
leader 3
thus uses it for future replication decisions.
Next, Peer 3
wants to carry out A2 = REMOVE_REPLICA(1)
. It needs three out
of four acks for this, and the previous game repeats. Let's say 1
doesn't even
receive the entry nor the fact that it commits (once it does); 2
gets the entry
but never learns that it commits, and 3
and 4
get the entry and commit it, too.
ID | Cfg | Log |
---|---|---|
1 | 1,2,3 | [e] A1 |
2 | 1,2,3 | [e] A1 A2 |
3 | 1,2,3,4 | e [A1] A2 |
4 | 1,2,3,4 | e [A1] A2 |
A moment later, 3
and 4
apply the config change A2
and begin using it.
ID | Cfg | Log |
---|---|---|
1 | 1,2,3 | [e] A1 |
2 | 1,2,3 | [e] A1 A2 |
3 | 2,3,4 | e A1 [A2] |
4 | 2,3,4 | e A1 [A2] |
Now there's a network partition between {1, 2}
and {3,4}
. 2
calls an
election and 1
votes for it. Since 2
is using the initial configuration,
this is enough to consider itself winner, and it steps up as a leader.
But 3
also still considers itself leader, and even more, is actually able to
make progress perfectly well despite there being a leader at a higher term
already (2
). At this point, all is already lost, but we'll keep going anyway.
Let's say 3
commits and applies some more data records (for example user
writes) which it can do since {3,4}
is a quorum of {2,3,4}
:
ID | Cfg | Log |
---|---|---|
1 | 1,2,3 | [e] A1 |
2 | 1,2,3 | [e] A1 A2 |
3 | 2,3,4 | e A1 A2 x y [z] |
4 | 2,3,4 | e A1 A2 x y [z] |
in the meantime, {1,2}
also sees some incoming proposals, though they're only
queued in the log at the leader 2
:
ID | Cfg | Log |
---|---|---|
1 | 1,2,3 | [e] A1 |
2 | 1,2,3 | [e] A1 A2 a b c |
3 | 2,3,4 | e A1 A2 x y [z] |
4 | 2,3,4 | e A1 A2 x y [z] |
2
now begins to do the work that has been queued up. It distributes the log to
1
(which is enough to commit it) and lets 1
know:
ID | Cfg | Log |
---|---|---|
1 | 1,2,3 | [e] A1 A2 a b c |
2 | 1,2,3 | [e] A1 A2 a b c |
3 | 2,3,4 | e A1 A2 x y [z] |
4 | 2,3,4 | e A1 A2 x y [z] |
Now 2
applies the newly committed log indexes. First it sees two configuration
changes which it will activate for future quorum decisions (there won't be any
in this example), and then it applies a
, b
, and c
to the state machine
(which also definitely tells clients that the commands were successfully
committed). 1
does the same.
Next, the partition heals. The two leaders get in touch with each other, and one is going to overwrite the others' log, replacing committed records (unless some assertion kills one or both leaders; doesn't matter -- the damage is done).
There are variations of this argument that use the fact that the commit index known to a peer can regress when peers restart, so moving to commit-time instead of applied-time activation of configuration changes does not prevent this kind of problem. However, apply-time is strictly worse. For one, there is no requirement that replicas need to apply committed changes at all, that is, they can lag behind as much as they want, and can use configuration changes many generations old. There are very straightforward counter-examples found in this alone, though we opt for one that is more intricate to show the difficulties in trying to patch the algorithm.
The above examples will not cause problems in the etcd/raft
implementation,
though this seems largely incidental. In etcd/raft
, the commit index is
communicated with each new append, which leads to the second membership change
to force the first one to be explicitly committed. This upholds the invariant
that only adjacent configurations may be active concurrently if it also forces
the first change to be applied in the same Raft cycle.
It turns out that a etcd/raft
follower does not necessarily apply the config
change immediately when it is marked as committed, because etcd/raft
limits
the size of the total slice of entries handed to userspace for application to
the state machine. This mechanism exists because loading too many entries into
memory is dangerous, but in effect it allows us to fix the example: by adding
regular commands to the logs as "fillers" and assuming that 2
, we can
arbitrarily increase the number of Raft cycles until 2
actually applies the
first configuration change. However, it will help commit the second
configuration in the meantime, though, so that 3
and 4
can start using it
already. This puts us back in the situation needed to end up in split brain: 1
and 2
use the initial config {1,2,3}
, while 3
and 4
use the final one
{2,3,4}
.
It's also worth pointing out that decoupling the application of commands to the state machine from their commit further is interesting from a performance point of view. As rare as this counterexample is today (never observed as far as I can tell), it may become less rare in the future.
Note also that there's an "ingredient" we haven't used yet: the commit index
may, in principle, regress because there's no requirement to sync it to disk
durably. etcd/raft
tends to sync it quite frequently because it asks to have
it synced whenever a log entry is appended. This is another area in which future
performance improvements may rip open new holes.