This is the third and final article in a series about group commit in MySQL. The first article discussed the background: group commit in MySQL does not work when the binary log is enabled. The second article explained the part of the InnoDB code that is responsible for the problem.
So how do we fix group commit in MySQL? As we saw in the second article of this series, we can just eliminate the prepare_commit_mutex
from InnoDB, extend the binary logging to do group commit by itself, and that would solve the problem.
However, we might be able to do even better. As explained in the first article, with binary logging enabled we need XA to ensure consistency after a crash, and that requires to do three fsyncs for a commit. Even if each of those can be shared with other transactions using group commit, it is still expensive. During a discussion on the maria-developers@ mailing list, an idea came up for how to do this with only a single (shared) fsync()
for a commit.
The basic idea is to only do fsync()
for the binary log, not for the engine, corresponding to running with innodb_flush_log_at_trx_commit
set to 2 or even 0.
If we do this, we can end up in the following situation: some transaction A is written into the binary log, and fsync()
makes sure that is stored durably on disk. Then transaction A is committed in InnoDB. And before the operating system and hardware gets around to store the InnoDB part of A durably on disk, we get a crash.
Now on crash recovery, we will have A in the binary log, but in the engine A may be lost, causing an inconsistency. But this inconsistency can be resolved simply by re-playing the transaction A against InnoDB, using the data for A stored in the binary log. Just like it would normally be applied on a replication slave. After re-playing the transaction, we will again be in a consistent state.
In order to do this, we need two things:
- For each transaction, we need to store in the InnoDB engine information about which is the corresponding position in the binary log, so that at crash recovery we will know from which position in the binary log to start re-playing transactions from.
- We also need to ensure that the order of commits in the binary log and in InnoDB is the same! Otherwise, after a crash we could find ourselves in the situation that the binary log has transaction A followed by transaction B, while the InnoDB storage engine contains only transaction B committed, not transaction A. This would leave us with no reliable place in the binary log to start re-playing transactions from.
Now, for ensuring same commit order, we do not want to re-introduce the (by now) infamous prepare_commit_mutex
, as that would make it impossible to have group commit for the binary log. Instead we should use another way to ensure such order. There are several ways to do this. Mark Callaghan explained one such way to do this at the MySQL conference, described further in this article.
The basic idea is that when writing transactions into the binary log, we remember their ordering. We can do this by putting the transactions into a queue, by assigning them a global transaction id in monotonic sequence, or by assigning them some kind of ticket as Mark suggests. Then inside innobase_commit()
, transactions can coordinate with each other to make sure they go into the engine in the order dictated by the queue, global transaction id, or ticket.
I think I have a working idea for how to extend the storage engine API to be able to do this in a clean way for any transactional engine. We can Introduce an optional handler call commit_fast()
that is guaranteed to be called in the same order as transactions are written to the binary log, prior to the normal commit handler call. Basically it would be called under a binary log mutex. The idea is that commit_fast()
will contain the “fast” part of innobase_commit()
, as explained in the previous article. Then in commit_fast()
, the engine can do the assignment of a ticket or insertion into a queue, as needed.
I think possibly for symmetry we would want to also add a similar xa_prepare_fast()
handler call that would be invoked after the normal xa_prepare()
and similarly be guaranteed to be in the same order as binary log commit, though I need to consider this a bit more to fully make up my mind.
I believe such an addition to the storage engine API would allow to implement in a clean way for all engines the method of re-playing the binary log at crash recovery to avoid more than a single fsync()
at commit.
So this concludes the series. Using these ideas, I hope we will soon see patches for MySQL and MariaDB that greatly enhances the performance for durable and crash-safe commits, so that we can finally declare Peter’s original Bug#13669 for fixed!
Probably not 100% trustworthy.
I’m actually a bit scared of this approach. I might be over-reacting, but I distrust binary logs as a way to reliably replay what happened on a system. There are so many edge cases that I’ve observed in production causing different results.
I would be more in favor of something like this: if a transaction isn’t in InnoDB’s transaction log, discard it and roll back the state of the whole system to the last committed transaction inside InnoDB. Forget about what’s in the binary log (or just truncate it).
I know this is not the right solution, and the binary log replay idea is conceptually the right approach for group commit across all engines. But the binary log is not ACID, results aren’t always repeatable, and I trust InnoDB a lot more than I trust the binary log. The binary log, whether row-based or statement-based, is logical. InnoDB’s redo logs are physical. I trust that.
Maybe the user can be offered the choice in a configuration option.
Re: Probably not 100% trustworthy.
It is true that at least statement-based binlogging is not 100% reliable.
But, if one is using replication, one is already relying on the binary log. In this case, truncating the binary log to latest InnoDB snapshot is not safe, as a slave may already have applied the transactions that are truncated. In this case it is necessary to apply the transactions on the master to ensure consistency between master and slave.
If not using replication, then truncating the binary log instead of re-playing transactions against the engine should work. But … I ask myself, if one does not trust the binary log, why would one care about ensuring consistency between binary log and engine, or even enable the binary log at all?