tbg · June 5, 2015 18:02
diff --git a/gistfile1.txt b/gistfile1.txt
 The handling of the time signal in Cockroach is motivated by the fact that in
 Spanner, the target consistency is linearizability, so unrelated transactions must
 commit with (database) timestamps whose ordering reflects the order of their
 commit timestamps in absolute (i.e. your wrist watch) time. For that to happen,
 Spanner basically just "waits" out the clock skew when committing a transaction
 (prior to returning to the client). Since they do that, they go to great
 lengths to synchronize their clocks and have a good grip on the actual maximal
 possible offset.

 Cockroach only shoots for serializability (though we do offer linearizable, if
 you're prepared to wait and get the offset down), so while transactions run
 with high isolation, you might be running T1 and then T2 hitting different
 parts of the cluster, and you might get back timestamps that suggest that T2
 committed before T1 when really it was the other way around. I have yet to see
 a use case where this matters (if you're running both transactions in a
 causally related way, there are simple ways to prevent this "anomaly"), and it
 means that the time signal is vastly less important than it is in Spanner.

 Now, to address your concerns:

 I don't see the time signal as having a large influence on network partitions. If
 you partition, only the majority remains functional anyway. If you have a
 specific concern, I'm happy to discuss it.

 The time signal matters mostly if you're trying to access a key at a certain
 database time, and there's data with a timestamp in the near future of where
 you want to read. Then you can't be sure whether that write happened in your
 absolute past or not, but you need to know if you want to be serializable.
 In Cockroach, your read will retry (increasing its timestamp so that the write
 is in its past), with some optimizations in place that are explained in our
 design doc (to keep you from restarting over and over on busy keys).
 That period of uncertainty is supplied by configuration (the epsilon, MaxOffset).

 This is going to be a little in-depth, but the basic message is: Cockroach
 trusts the MaxOffset, and if your clocks don't live up to the promise, you
 might get some stale reads. By the way, Spanner breaks in the same way if
 their clock offset (via their TrueTime API) failed them. But Spanner has to
 wait out the MaxOffset on every commit, we don't - so we get away with having
 it high enough for off-the-shelf clock synchronization and save you the atomic
 clocks, at similar guarantees. That's a very good deal.

 MaxOffset is high by default (200ms) or you can set it manually.
 Either way, Cockroach takes that value as authoritative - meaning at the end of
 the day the user needs to make sure this holds (with 200ms, that's hopefully
 not a real burden). The cluster offset is continuously measured by each node,
 and if they find themselves outside of the safe interval, they will stop
 participating in the action. That leaves a short interval of time in which a
 node may be exceeding MaxOffset, but not aware.
 So what if it is? Well, then there's a chance you might be serving some stale
 reads from that replica on busy keys (if you're behind and the future writes
 seem at "safe distance" from the local point of view), though that presupposes
 that you're leading the replica group, in which case you are proposing the
 writes (and you won't let them through if the timestamps within seem to
 disprove what you assume about the MaxOffset), so likely you'd only be able to
 see this during a leadership change. Now assume you don't really trust your
 clocks, what can you do? Well, you can bump up the MaxOffset as far as you
 like. What you pay for are more restarts on busy keys when unsynchronized nodes
 work on them concurrently, but that may be perfectly acceptable for your
 workload (if you have really bad clocks, then chances are you're not running in
 production).
	The handling of the time signal in Cockroach is motivated by the fact that in
	Spanner, the target consistency is linearizability, so unrelated transactions must
	commit with (database) timestamps whose ordering reflects the order of their
	commit timestamps in absolute (i.e. your wrist watch) time. For that to happen,
	Spanner basically just "waits" out the clock skew when committing a transaction
	(prior to returning to the client). Since they do that, they go to great
	lengths to synchronize their clocks and have a good grip on the actual maximal
	possible offset.

	Cockroach only shoots for serializability (though we do offer linearizable, if
	you're prepared to wait and get the offset down), so while transactions run
	with high isolation, you might be running T1 and then T2 hitting different
	parts of the cluster, and you might get back timestamps that suggest that T2
	committed before T1 when really it was the other way around. I have yet to see
	a use case where this matters (if you're running both transactions in a
	causally related way, there are simple ways to prevent this "anomaly"), and it
	means that the time signal is vastly less important than it is in Spanner.

	Now, to address your concerns:

	I don't see the time signal as having a large influence on network partitions. If
	you partition, only the majority remains functional anyway. If you have a
	specific concern, I'm happy to discuss it.

	The time signal matters mostly if you're trying to access a key at a certain
	database time, and there's data with a timestamp in the near future of where
	you want to read. Then you can't be sure whether that write happened in your
	absolute past or not, but you need to know if you want to be serializable.
	In Cockroach, your read will retry (increasing its timestamp so that the write
	is in its past), with some optimizations in place that are explained in our
	design doc (to keep you from restarting over and over on busy keys).
	That period of uncertainty is supplied by configuration (the epsilon, MaxOffset).

	This is going to be a little in-depth, but the basic message is: Cockroach
	trusts the MaxOffset, and if your clocks don't live up to the promise, you
	might get some stale reads. By the way, Spanner breaks in the same way if
	their clock offset (via their TrueTime API) failed them. But Spanner has to
	wait out the MaxOffset on every commit, we don't - so we get away with having
	it high enough for off-the-shelf clock synchronization and save you the atomic
	clocks, at similar guarantees. That's a very good deal.

	MaxOffset is high by default (200ms) or you can set it manually.
	Either way, Cockroach takes that value as authoritative - meaning at the end of
	the day the user needs to make sure this holds (with 200ms, that's hopefully
	not a real burden). The cluster offset is continuously measured by each node,
	and if they find themselves outside of the safe interval, they will stop
	participating in the action. That leaves a short interval of time in which a
	node may be exceeding MaxOffset, but not aware.
	So what if it is? Well, then there's a chance you might be serving some stale
	reads from that replica on busy keys (if you're behind and the future writes
	seem at "safe distance" from the local point of view), though that presupposes
	that you're leading the replica group, in which case you are proposing the
	writes (and you won't let them through if the timestamps within seem to
	disprove what you assume about the MaxOffset), so likely you'd only be able to
	see this during a leadership change. Now assume you don't really trust your
	clocks, what can you do? Well, you can bump up the MaxOffset as far as you
	like. What you pay for are more restarts on busy keys when unsynchronized nodes
	work on them concurrently, but that may be perfectly acceptable for your
	workload (if you have really bad clocks, then chances are you're not running in
	production).