gist:8611559

The specific issue we care about is previously written data silently being corrupted by disk (eg. bit rot)

Consider paxos+log or Raft style approach. Once value is committed from log to replicated state/DB, it's assumed to be stable. But, there's no guarantee the disk doesn't lose that data later. While the current approach in Riak doesn't use traditional propose/commit log, the underlying problem is the same.

Example of problem we want to avoid.

We have 3-replicas, A/B/C with the following committed state (no other operations in flight):
A: a=100, b=200  (currently offline/partitioned)
B: a=100, b=300
C: a=100, b=300

Disk corruption on node C causes it to silently lose key 'b' (undetected):
A: a=100, b=200  (currently offline/partitioned)
B: a=100, b=300
C: a=100

Node B goes offline/becomes partitioned, node A comes back online/unpartitioned:
A: a=100, b=200
B: a=100, b=300  (currently offline/partitioned)
C: a=100

Consensus system sees a majority of nodes A+C. And can bring them both current. But, regardless of if it picks A or C as the correct state, neither node has record of 'b' being 300. So, reading 'b' either returns 'not found' or '200'.

This is similar to a byzantine paxos scenario. In byzantine paxos, peers must wait to hear from m+1 peers before proceeding, where m is the number of potentially faulty peers. For this case where m=1, each peer would need to hear from 2 peers other than itself, and thus the consensus group is unavailable unless all three nodes A/B/C are online -- which would be correct here, since once node B is online, we have a node with the correct most recent state.

For Riak, the approach is similar, except the plan is to not check for faults on every request. So, we're trading some risk for performance.

When a peer reboots, it is initially not trusted. It only becomes trusted after syncing with a majority of other peers. This syncing uses Riak's built-in active anti-entropy (AAE). The AAE approach allows for every fast syncing in the common case. In Riak, AAE trees are kept up to date in real-time. However, AAE trees themselves could also become corrupted, or have valid hashes for K/V data that has since become corrupted on-disk. Riak handles this by discarding AAE trees every week, and rebuilding the trees by scanning over all on-disk K/V data. Thus, any corruption that occurs is detected within a week.

Once a peer syncs with other peers, it becomes trusted and can take part in consensus.

There are a few issues here though. Since corruption is silent, we can't detect faulty from non-faulty. So, we always assume a node is faulty after a restart until it first syncs up.

Requiring this syncing reduces liveness. Normal paxos only needs f+1 nodes online to reach quorum; now we need f+m. Normally, a 3-node cluster can tolerate 1 node being offline and still be available. But, we can't do that if we assume one of the 2 online nodes could be faulty. We need all 3 nodes online. Similarily, for 5 nodes, we need 4 out of 5 nodes to be online at all times, to tolerate 1 online but corrupted node. It's not until we have a 7 node cluster that we can tolerate 2 offline nodes and 1 faulty node. If we assume more than 1 corrupted node, things are even worse.

So, things become more expensive. Need larger clusters. It's rather annoying.

But, I'd rather be safe then sorry. We've had users who had silent data corruption. We've had users with self-inflicited corruption (partial overwrite of some data from backup, but not all data).

Current plan is to ship with configurable "paranoia": trust disks (don't need syncing, always available with simple majority), don't trust K/V data but trust AAE trees (requires syncing), don't trust anything (requires discarding AAE trees + scanning over entire K/V data to rebuild AAE trees + syncing).

Default would probably be middle option: don't trust K/V but trust AAE.

Not sure users will be thrilled to hear they need 5 or 7 replicas, and imagine many will just move to "trusted" option. But, we'll see.

Yes, C will always be elected leader. That's true.

The issue is the assumption that the corruption of key C is detected, it's not going to be in this scenario. The key is entirely missing, not just corrupted.

For example, user takes node offline, and accidentally deletes backend data. Or overwrites backend data from a prior backup. The risk is the K/V data being modified while the consensus state isn't -- so, node comes back online still believing it has state X when it doesn't.

Even for corruption scenario, it depends on how the backend handles that. If you have missing/corrupted LevelDB SSTs and run LevelDB repair, you end up with a backend that is no longer corrupted but is silently missing a range of keys.

Third issue is that even when checksum is detected by Riak, Riak currently just converts that to a notfound, since for eventually consistent Riak that's a totally valid answer -- read repair/AAE eventually will repair the key. This could be special cased for consistent Riak, but I'd rather solve the problem more generally that also address the undetected failures above.

Then again, I might as well make the failed checksum do the right thing anyway, since that's an easy change and helps for anyone using the "less paranoid" option. In the end though, still lots of corner cases that aren't as straightforward.

jtuple/gist:8611559

semistrict commented Jan 25, 2014

Uh oh!

jtuple commented Jan 28, 2014

Uh oh!