Slightly disorganized but reasonably complete notes on the algorithms, strategies and optimizations of the Akka Cluster implementation. Could use a lot more links and context etc., but was just written for my own understanding. Might be expanded later.
Links to papers and talks that have inspired the implementation can be found on the 10 last pages of this presentation.
This is the Gossip state representation:
case class Gossip(
members: immutable.SortedSet[Member],
overview: GossipOverview,
version: VectorClock)
case class GossipOverview(
seen: Set[UniqueAddress],
reachability: Reachability)
- The gossip state is a CRDT (which means that it has a monotonic merge function).
- Gossip includes a
seen
set which represents all nodes that have seen this particular gossip version (versioned by the Vector Clock). - It also includes list of
unreachable
nodes. - Cluster convergence is reached when:
- All nodes are represented in the
seen
set (a more accurate definition would be 'Local proof of a global state in the past is reached when locally a gossip is observed where all nodes are represented in the seen set' but that is a tad more complicated) - We don't have any members that are
unreachable
, or - All
unreachable
members in the set have statusdown
orexiting
.
- All nodes are represented in the
- Using GZIP of gossip state works very well since the gossip state contains a lot of redundant information (strings with URLs etc.).
- Every gossip round is performed by picking a random node with a older or newer version—if it exists—else just any random node.
- The gossiper then sends a gossip message to the selected recipient node in a request/reply conversational fashion.
- The recipient uses the gossip version to determine whether it has a newer, older or conflicting version of the gossip state.
- In the case of conflicting gossip versions the different versions are merged, else
- The gossip with the newest version is kept.
- The recipient adds itself to the
seen
table and sends back the updated gossip state to the gossiper.
- If the gossiper and the recipient have seen the same version of the gossip (i.e the recipient is part of the
seen
table and has the same gossip version as the gossiper) then the gossip state is not sent back to the gossiper. - If less than ½ of nodes resides in the
seen
set (have seen the new state) then the cluster gossips 3 times instead of once every second. - If only 1 node is left until convergence is reached then that node is likely to be flooded with gossip requests. In this case the node enters a protection mode in which it throws gossip requests away.
- If no convergence then it uses a probability of 0.8 (80 %) to gossip to a node not part of the
seen
set—i.e. that has an older version of the state. - For clusters larger than 400 nodes (suggested by empirical evidence) the 0.8 probability is gradually reduced to avoid overwhelming single stragglers with too many concurrent gossip requests.
- If convergence is reached then the cluster only gossips a special
Status(VectorClock)
message, and as soon as there is a change to the cluster (meaning non-convergence) then go back to Biased Gossip again.
- Akka's Failure Detector is an Accrual Failure Detector, but have not seemed to gain much from it. Mainly due to the fact that Garbage Collection in Java ofter causes much longer latency spikes than temporary network drops. It turns out that the configuration option that has the biggest impact on the accuracy is the
acceptable-heartbeat-pause
timeout. The Accrual Failure Detector could perhaps be removed and replaced by a hard timeout. - Every node picks 5 other nodes (configurable) to track heartbeats from. These nodes are picked out of a hashed ordered node ring. This to increase likelihood to monitor across racks and data centers.
- Heartbeats are sent out every second and every heartbeat is performed in a request/reply handshake with the replies used as input to the failure detector.
- A much more advanced but complicated scheme used to be used, where each node only sent a single one-way heartbeat. Each node knew who to expect heartbeat from and could based on this info alone calculate if the sender was
unreachable
or not. This strategy was especially complicated in the start up phase of the cluster and has now been replaced by the simpler request/reply strategy.
- A node that is suspected to be "down" is marked as
unreachable
in the gossip state. - Only one node needs to mark a node
unreachable
to have the rest of the cluster mark that nodeunreachable
. - When one or many nodes are marked as
unreachable
the cluster can no longer reach convergence and the leader can not perform its duties. Then we have reached a so-called 'Split Brain' and the only way forward is to mark theunreachable
node(s) asdown
, or wait for the node to becomereachable
again. This halt only affects the leader in performing its cluster membership management and does not influence the application running on top of the cluster. - A node that is marked as
unreachable
can at any point in time come back toreachable
. This can be seen as a variation of the quarantine model with a suspicion set that is used in the SWIM paper. - A node (or set of nodes) that is
unreachable
can be taken out of the cluster either through adown
operation. When this is done the cluster will reach convergence again and can continue working as normal. - Once a node has been
downed
orremoved
it can not come back toreachable
again but has to be restarted and join the cluster as a new node. It can reuse the samehost:port
but will get a differentuid
. - If system messages cannot be delivered to a node it will be quarantined and then it cannot come back from
unreachable
. This can happen if the there are too many unacknowledged system messages (e.g.watch
,Terminated
, remote actor deployment, failures of actors supervised by remote parent). Then the node needs to be moved to thedown
orremoved
states and the actor system must be restarted before it can join the cluster again.
- After gossip convergence a leader for the cluster can be determined.
- The leader is just a role, any node can be the leader and it can change between convergence rounds.
- There is no leader election process, the leader can always be recognized deterministically by any node whenever there is gossip convergence and is currently the first node in the sorted node ring.
- The role of the leader is to shift members in and out of the cluster, changing
joining
members to theup
state orexiting
members to theremoved
state. Currently leader actions are only triggered by receiving a new cluster state with gossip convergence.
- Delegated heartbeat ala SWIM to find ways around a network split.
- "Real" push/pull gossip with fine grained hashing of state and only shipping deltas, perhaps using Merkle trees.