Skip to content

Instantly share code, notes, and snippets.

@jtuple
Created September 8, 2011 00:00
Show Gist options
  • Save jtuple/1202216 to your computer and use it in GitHub Desktop.
Save jtuple/1202216 to your computer and use it in GitHub Desktop.
1.0-cluster-changes
Given that 1.0 prerelease packages are now available, I wanted to briefly mention some changes to Riak's clustering capabilities in 1.0. In particular, there are some subtle semantic differences in the riak-admin commands. More complete docs will be updated in the near future, but I hope a quick email suffices for now.
[nodeB/riak-admin join nodeA] is now strictly one-way. It joins nodeB to the cluster that nodeA is a member of. This is semantically different than pre-1.0 Riak in which join essentially joined clusters together rather than joined a node to a cluster. As part of this change, the joining node (nodeB in this case) must be a singleton (1-node) cluster.
In pre-1.0, leave and remove were essentially the same operation, with leave just being an alias for 'remove this-node'. This has changed. Leave and remove are now very different operations.
[nodeB/riak-admin leave] is the only safe way to have a node leave the cluster, and it must be executed by the node that you want to remove. In this case, nodeB will start leaving the cluster, and will not leave the cluster until after it has handed off all it's data. Even if nodeB is restarted (crashed/shutdown/whatever), it will remain in the leave state and continue handing off partitions until done. After handoff, it will leave the cluster, and eventually shutdown.
[nodeA/riak-admin remove nodeB] immediately removes nodeB from the cluster, without handing off its data. All replicas held by nodeB are therefore lost, and will need to be re-generated through read-repair. Use this command carefully. It's intended for nodes that are permanently unrecoverable and therefore for which handoff doesn't make sense. By the final 1.0 release, this command may be renamed "force-remove" just to make the distinction clear.
There are now two new commands that provide additional insight into the cluster. [riak-admin member_status] and [riak-admin ring_status].
Underneath, the clustering protocol has been mostly re-written. The new approach has the following advantages:
1. It is no longer necessary to wait on [riak-admin ringready] in between adding/removing nodes from the cluster, and adding/removing is also much more sound/graceful. Starting up 16 nodes and issuing [nodeX: riak-admin join node1] for X=1:16 should just work.
2. Data is first transferred to new partition owners before handing over partition ownership. This change fixes numerous bugs, such as 404s/not_founds during ownership changes. The Ring/Pending columns in [riak-admin member_status] visualize this at a high-level, and the full transfer status in [riak-admin ring_status] provide additional insight.
3. All partition ownership decisions are now made by a single node in the cluster (the claimant). Any node can be the claimant, and the duty is automatically taken over if the previous claimant is removed from the cluster. [riak-admin member_status] will list the current claimant.
4. Handoff related to ownership changes can now occur under load; hinted handoff still only occurs when a vnode is inactive. This change allows a cluster to scale up/down under load, although this needs to be further benchmarked and tuned before 1.0.
To support all of the above, a new limitation has been introduced. Cluster changes (member addition/removal, ring rebalance, etc) can only occur when all nodes are up and reachable. [riak-admin ring_status] will complain when this is not the case. If a node is down, you must issue [riak-admin down <node>] to mark the node as down, and the remaining nodes will then proceed to converge as usual. Once the down node comes back online, it will automatically re-integrate into the cluster. However, there is nothing preventing client requests being served by a down node before it re-integrates. Before issuing [down <node>], make sure to update your load balancers / connection pools to not include this node. Future releases of Riak may make offlining a node an automatic operation, but it's a user-initiated action in 1.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment