Some notes about MySQL's binlog_group_commit_sync_delay

I think the binlog_group_commit_sync_delay=1000 is not what we want, and it's why sync_binlog=0 has such a big impact on replica throughput.

From the MySQL docs:

Controls how many microseconds the binary log commit waits before synchronizing the binary log file to disk. By default binlog_group_commit_sync_delay is set to 0, meaning that there is no delay. Setting binlog_group_commit_sync_delay to a microsecond delay enables more transactions to be synchronized together to disk at once, reducing the overall time to commit a group of transactions because the larger groups require fewer time units per group.

Setting binlog_group_commit_sync_delay can increase the number of parallel committing transactions on any server that has (or might have after a failover) a replica, and therefore can increase parallel execution on the replicas.

This was set to 1000 (1 millis) a few weeks ago by myself in an attempt to increase parallel execution on the replicas. It didn't have a negative impact on query duration and we did not observe replication lag at the time.

The group commit delay can increase the total transaction throughput on a database with a highly concurrent workload because it caps the number of I/O flush calls for the binlogs (at most one every binlog_group_commit_sync_delay microseconds). More importantly, it allows the sequencing metadata for transactions to be organized for more parallelism. Without inducing an artificial delay window, only transactions which happen to be waiting for sync at the same time will be taken into account.

However, based on experimenting with this variable on the replica independent of the primary, I believe one of the problems is that this setting also induces unwanted artificial stalls in the commits taking place on replicas as well.

Multi-threaded replication assigns transactions in the relay logs to applier threads partitioned by writeset-based conflict metadata. The relay log data for the transaction includes sequencing information, indicating the transaction ID previous to it in the commit order timeline. Even though the applier thread might be able to begin executing the transaction's writes, it cannot actually commit until the previous transaction commits. So while many transactions can be executing concurrently, their final commit step must wait until the "before" transactions commits first. While this ordering constraint does sound a bit like single-threaded replication, in reality, being able to exploit InnoDB's ability to efficiently execute many transactions in parallel is very worthwhile.

The primary is doing more work generally speaking, so it can be initially confusing as to why a replica can so easily fall behind the primary, but the replica is disadvantaged when it comes to parallel transaction execution. It can't begin executing a transaction until it has committed on the primary. On the primary it's perfectly reasonable to have a transaction write thousands or millions of rows of data over many minutes while other millions of other transactions overlap with it, and InnoDB provides an atomic and nearly instant commit of all that data at once.

However, once that big transaction hits a replica, any transactions "after" it must wait for it to complete and commit entirely, to preserve commit ordering. The sequencing of that big transaction isn't established until it commits on the primary so that it instantly appears at the correct point in the timeline. The replica then has to play catch-up -- it needs to make millions of rows appear atomically. It has no choice but to hold back other transactions to maintain this "illusion" that it appeared at that point in time, while it applies all of the writes involved in this transaction from start to finish.

That is an extreme example, but it illustrates the point that ordering constraints create stalls in multi-threaded replication. But what about binlog_group_commit_sync_delay? Well, it introduces a delay into the commit process, which also impacts replica commits, which run behind the primaries. Every commit by the appliers must now wait for the delay window to pass too.

At least in theory, the benefit to group commit is that more transactions can be ordered together which allows more concurrency, as the dependency chain of transactions is essentially shorter and wider (more parallel!). However, a delay of 1000 micros might be one of the worst settings to achieve it in this particular case. With a write workload of about 2000 TPS, it's not really enough to allow transactions to accumulate and commit together in a beneficial way. The storage has plenty of I/O capacity and writes to the binlogs can be done while the fsync is pending. The group commit delay thwarts it by forcing each group into a fixed size time window and slowing their progress, which in practice appears to outweigh the benefit of the small-scale group commit. Without the delay, there are about the same sized commit groups but without unnecessary artificial waits.

If binlog_group_commit_sync_delay is raised higher, it increases the size of the commit groups, but it induces perceptible delays to transaction commit at the client level. A delay of 10 millis is very noticeable! This may have a benefit for some workloads, but it's too workload-specific and a bit precarious to put this in place generally speaking. I did notice that if it's increased only on the replica, it is even worse, as the commit groups are still small, but the artificial delay in the commit process on the replica that much higher. Decrease this number to below the frequency of commits (<=500micros) and the group sizes drop to 1 most of the time. That's probably fine, however, in this case, as disk I/Os are not a limiting factor in the current configuration.

Takeaways? If I/O is a bottleneck or there isn't enough parallelism being achieved in the multi-threaded replication pipeline, binlog_group_commit_sync_delay can be a useful tool, but it needs to be either low enough to not significantly block the replica's applier threads or high enough to group together enough transactions at once to make up for it. In the middle and it will just artificially limit the throughput of an applier, which is particularly noticeable during log catch-up on a restore or backup pod. The inverse goldilocks. Given that, I actually struggle to come up with scenarios where it is a real improvement. Anecdotally, there are reports of it helping real world use cases, but it's unclear what the calculus is behind them. These might even be a relic of pre-writeset-based dependency tracking.

rbranson/binlog_group_commit_sync_delay.md