Created
June 5, 2015 18:02
-
-
Save tbg/57bcccc379b151456044 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The handling of the time signal in Cockroach is motivated by the fact that in | |
Spanner, the target consistency is linearizability, so unrelated transactions must | |
commit with (database) timestamps whose ordering reflects the order of their | |
commit timestamps in absolute (i.e. your wrist watch) time. For that to happen, | |
Spanner basically just "waits" out the clock skew when committing a transaction | |
(prior to returning to the client). Since they do that, they go to great | |
lengths to synchronize their clocks and have a good grip on the actual maximal | |
possible offset. | |
Cockroach only shoots for serializability (though we do offer linearizable, if | |
you're prepared to wait and get the offset down), so while transactions run | |
with high isolation, you might be running T1 and then T2 hitting different | |
parts of the cluster, and you might get back timestamps that suggest that T2 | |
committed before T1 when really it was the other way around. I have yet to see | |
a use case where this matters (if you're running both transactions in a | |
causally related way, there are simple ways to prevent this "anomaly"), and it | |
means that the time signal is vastly less important than it is in Spanner. | |
Now, to address your concerns: | |
I don't see the time signal as having a large influence on network partitions. If | |
you partition, only the majority remains functional anyway. If you have a | |
specific concern, I'm happy to discuss it. | |
The time signal matters mostly if you're trying to access a key at a certain | |
database time, and there's data with a timestamp in the near future of where | |
you want to read. Then you can't be sure whether that write happened in your | |
absolute past or not, but you need to know if you want to be serializable. | |
In Cockroach, your read will retry (increasing its timestamp so that the write | |
is in its past), with some optimizations in place that are explained in our | |
design doc (to keep you from restarting over and over on busy keys). | |
That period of uncertainty is supplied by configuration (the epsilon, MaxOffset). | |
This is going to be a little in-depth, but the basic message is: Cockroach | |
trusts the MaxOffset, and if your clocks don't live up to the promise, you | |
might get some stale reads. By the way, Spanner breaks in the same way if | |
their clock offset (via their TrueTime API) failed them. But Spanner has to | |
wait out the MaxOffset on every commit, we don't - so we get away with having | |
it high enough for off-the-shelf clock synchronization and save you the atomic | |
clocks, at similar guarantees. That's a very good deal. | |
MaxOffset is high by default (200ms) or you can set it manually. | |
Either way, Cockroach takes that value as authoritative - meaning at the end of | |
the day the user needs to make sure this holds (with 200ms, that's hopefully | |
not a real burden). The cluster offset is continuously measured by each node, | |
and if they find themselves outside of the safe interval, they will stop | |
participating in the action. That leaves a short interval of time in which a | |
node may be exceeding MaxOffset, but not aware. | |
So what if it is? Well, then there's a chance you might be serving some stale | |
reads from that replica on busy keys (if you're behind and the future writes | |
seem at "safe distance" from the local point of view), though that presupposes | |
that you're leading the replica group, in which case you are proposing the | |
writes (and you won't let them through if the timestamps within seem to | |
disprove what you assume about the MaxOffset), so likely you'd only be able to | |
see this during a leadership change. Now assume you don't really trust your | |
clocks, what can you do? Well, you can bump up the MaxOffset as far as you | |
like. What you pay for are more restarts on busy keys when unsynchronized nodes | |
work on them concurrently, but that may be perfectly acceptable for your | |
workload (if you have really bad clocks, then chances are you're not running in | |
production). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment