The specific issue we care about is previously written data silently being corrupted by disk (eg. bit rot)
Consider paxos+log or Raft style approach. Once value is committed from log to replicated state/DB, it's assumed to be stable. But, there's no guarantee the disk doesn't lose that data later. While the current approach in Riak doesn't use traditional propose/commit log, the underlying problem is the same.
Example of problem we want to avoid.
We have 3-replicas, A/B/C with the following committed state (no other operations in flight):
A: a=100, b=200 (currently offline/partitioned)
B: a=100, b=300
C: a=100, b=300
Disk corruption on node C causes it to silently lose key 'b' (undetected):
A: a=100, b=200 (currently offline/partitioned)
B: a=100, b=300
C: a=100
Node B goes offline/becomes partitioned, node A comes back online/unpartitioned:
A: a=100, b=200
B: a=100, b=300 (currently offline/partitioned)
C: a=100
Consensus system sees a majority of nodes A+C. And can bring them both current. But, regardless of if it picks A or C as the correct state, neither node has record of 'b' being 300. So, reading 'b' either returns 'not found' or '200'.
This is similar to a byzantine paxos scenario. In byzantine paxos, peers must wait to hear from m+1
peers before proceeding, where m
is the number of potentially faulty peers. For this case where m=1
, each peer would need to hear from 2 peers other than itself, and thus the consensus group is unavailable unless all three nodes A/B/C are online -- which would be correct here, since once node B is online, we have a node with the correct most recent state.
For Riak, the approach is similar, except the plan is to not check for faults on every request. So, we're trading some risk for performance.
When a peer reboots, it is initially not trusted. It only becomes trusted after syncing with a majority of other peers. This syncing uses Riak's built-in active anti-entropy (AAE). The AAE approach allows for every fast syncing in the common case. In Riak, AAE trees are kept up to date in real-time. However, AAE trees themselves could also become corrupted, or have valid hashes for K/V data that has since become corrupted on-disk. Riak handles this by discarding AAE trees every week, and rebuilding the trees by scanning over all on-disk K/V data. Thus, any corruption that occurs is detected within a week.
Once a peer syncs with other peers, it becomes trusted and can take part in consensus.
There are a few issues here though. Since corruption is silent, we can't detect faulty from non-faulty. So, we always assume a node is faulty after a restart until it first syncs up.
Requiring this syncing reduces liveness. Normal paxos only needs f+1
nodes online to reach quorum; now we need f+m
. Normally, a 3-node cluster can tolerate 1 node being offline and still be available. But, we can't do that if we assume one of the 2 online nodes could be faulty. We need all 3 nodes online. Similarily, for 5 nodes, we need 4 out of 5 nodes to be online at all times, to tolerate 1 online but corrupted node. It's not until we have a 7 node cluster that we can tolerate 2 offline nodes and 1 faulty node. If we assume more than 1 corrupted node, things are even worse.
So, things become more expensive. Need larger clusters. It's rather annoying.
But, I'd rather be safe then sorry. We've had users who had silent data corruption. We've had users with self-inflicited corruption (partial overwrite of some data from backup, but not all data).
Current plan is to ship with configurable "paranoia": trust disks (don't need syncing, always available with simple majority), don't trust K/V data but trust AAE trees (requires syncing), don't trust anything (requires discarding AAE trees + scanning over entire K/V data to rebuild AAE trees + syncing).
Default would probably be middle option: don't trust K/V but trust AAE.
Not sure users will be thrilled to hear they need 5 or 7 replicas, and imagine many will just move to "trusted" option. But, we'll see.
In the problem description (nodes A, B, C) with the only healthy and up-to-date node B down you say reading 'b' returns either 200 or 'not-found'. But if you're using a consensus protocol then since A is not as up-to-date as C, A cannot be elected leader, so the read of 'b' should always be serviced by C. In this case if you have checksummed data you won't return 'not-found' you'll error and either give up leadership (in which case the cluster becomes unavailable) or maybe just fail all accesses to that key. What am I missing here?