This is the third and final article in a series about group commit in MySQL. The first article discussed the background: group commit in MySQL does not work when the binary log is enabled. The second article explained the part of the InnoDB code that is responsible for the problem.
So how do we fix group commit in MySQL? As we saw in the second article of this series, we can just eliminate the
prepare_commit_mutex from InnoDB, extend the binary logging to do group commit by itself, and that would solve the problem.
However, we might be able to do even better. As explained in the first article, with binary logging enabled we need XA to ensure consistency after a crash, and that requires to do three fsyncs for a commit. Even if each of those can be shared with other transactions using group commit, it is still expensive. During a discussion on the maria-developers@ mailing list, an idea came up for how to do this with only a single (shared)
fsync() for a commit.
The basic idea is to only do
fsync() for the binary log, not for the engine, corresponding to running with
innodb_flush_log_at_trx_commit set to 2 or even 0.
If we do this, we can end up in the following situation: some transaction A is written into the binary log, and
fsync() makes sure that is stored durably on disk. Then transaction A is committed in InnoDB. And before the operating system and hardware gets around to store the InnoDB part of A durably on disk, we get a crash.
Now on crash recovery, we will have A in the binary log, but in the engine A may be lost, causing an inconsistency. But this inconsistency can be resolved simply by re-playing the transaction A against InnoDB, using the data for A stored in the binary log. Just like it would normally be applied on a replication slave. After re-playing the transaction, we will again be in a consistent state.
In order to do this, we need two things:
- For each transaction, we need to store in the InnoDB engine information about which is the corresponding position in the binary log, so that at crash recovery we will know from which position in the binary log to start re-playing transactions from.
- We also need to ensure that the order of commits in the binary log and in InnoDB is the same! Otherwise, after a crash we could find ourselves in the situation that the binary log has transaction A followed by transaction B, while the InnoDB storage engine contains only transaction B committed, not transaction A. This would leave us with no reliable place in the binary log to start re-playing transactions from.
Now, for ensuring same commit order, we do not want to re-introduce the (by now) infamous
prepare_commit_mutex, as that would make it impossible to have group commit for the binary log. Instead we should use another way to ensure such order. There are several ways to do this. Mark Callaghan explained one such way to do this at the MySQL conference, described further in this article.
The basic idea is that when writing transactions into the binary log, we remember their ordering. We can do this by putting the transactions into a queue, by assigning them a global transaction id in monotonic sequence, or by assigning them some kind of ticket as Mark suggests. Then inside
innobase_commit(), transactions can coordinate with each other to make sure they go into the engine in the order dictated by the queue, global transaction id, or ticket.
I think I have a working idea for how to extend the storage engine API to be able to do this in a clean way for any transactional engine. We can Introduce an optional handler call
commit_fast() that is guaranteed to be called in the same order as transactions are written to the binary log, prior to the normal commit handler call. Basically it would be called under a binary log mutex. The idea is that
commit_fast() will contain the “fast” part of
innobase_commit(), as explained in the previous article. Then in
commit_fast(), the engine can do the assignment of a ticket or insertion into a queue, as needed.
I think possibly for symmetry we would want to also add a similar
xa_prepare_fast() handler call that would be invoked after the normal
xa_prepare() and similarly be guaranteed to be in the same order as binary log commit, though I need to consider this a bit more to fully make up my mind.
I believe such an addition to the storage engine API would allow to implement in a clean way for all engines the method of re-playing the binary log at crash recovery to avoid more than a single
fsync() at commit.
So this concludes the series. Using these ideas, I hope we will soon see patches for MySQL and MariaDB that greatly enhances the performance for durable and crash-safe commits, so that we can finally declare Peter’s original Bug#13669 for fixed!