Fixing MySQL group commit (part 2)

This is the second in a series of three articles about ideas for implementing full support for group commit in MariaDB. The first article discussed the background: group commit in MySQL does not work when the binary log is enabled. See also the third article.

Internally, InnoDB (and hence XtraDB) do support group commit. The way this works is seen in the innobase_commit() function. The work in this function is split into two parts. First, a “fast” part, which registers the commit in memory:

    trx->flush_log_later = TRUE;
    innobase_commit_low(trx);
    trx->flush_log_later = FALSE;

Second, a “slow” part, which writes and fsync’s the commit to disk to make it durable:

    trx_commit_complete_for_mysql(trx)

While one transaction is busy executing the “slow” part, any number of later transactions can complete their “fast” part, and queue up waiting for the running fsync() to finish. Once it does finish, a single fsync() of the log is now sufficient to complete the slow part for all of the queued-up transactions. This is how group commit works in InnoDB when the binary log is disabled.

When the binary log is enabled, MySQL uses XA/2-phase commit to ensure consistency between the binary log and the storage engine. This means that a commit now takes three parts:

    innobase_xa_prepare()
    write() and fsync() binary log
    innobase_commit()

Now, there is an extra detail to the prepare and commit code in InnoDB. InnoDB locks the prepare_commit_mutex in innobase_xa_prepare(), and does not release it until after the “fast” part of innobase_commit() has completed. This means that while one transaction is executing innobase_commit(), all subsequent transactions will be blocked inside innobase_xa_prepare() waiting for the mutex. As a result, no transactions can queue up to share an fsync(), and group commit is broken with the binary log enabled.

So, why does InnoDB hold the problematic prepare_commit_mutex across the binary logging? That turns out to be a really good question. After extensive research into the issue, it appears that in fact there is no good reason at all for the mutex to be held.

Comments in the InnoDB code, in the bug tracker, and elsewhere, mention that taking the mutex is necessary to ensure that commits happen in the same order in InnoDB and in the binary log. This is certainly true; without taking the mutex we can have transaction A committed in InnoDB before transaction B, but B written to the binary log before transaction A.

But this just raises the next question: why is it necessary to ensure the same commit order in InnoDB and in the binary log? The only reason that I could find stated is that this is needed for InnoDB hot backup and XtraBackup to be able to extract the correct binary log position corresponding to the state of the engine contained in the backup.

Sergei Golubchik investigated this issue during the 2010 MySQL conference, inspired by the many discussions of group commit that took place there. It turns out that XtraDB does a FLUSH TABLES WITH READ LOCK when it extracts the binary log position. This statement completely blocks the processing of commits until released, removing any possibility of different commit order in engine and binary log (InnoDB hot backup is closed source, so difficult to check, but presumably works in the same way). So there certainly is no need for holding the prepare_commit_mutex to ensure consistent binary log position for backups!

There is another popular way to do hot backups without using FLUSH TABLES WITH READ LOCK: LVM snapshots. But an LVM snapshot essentially runs the recovery algorithm at restore time. In this case, XA is used to ensure that engine and binary log are consistent at server start, eliminating any need to enforce same ordering of commits.

So it really seems that there just is no good reason for the prepare_commit_mutex mutex to exist in the first place. Unless someone can come up with a good explanation for why it should be needed, I am forced to conclude that we have lived with 5 years of broken group commit in MySQL solely because of incorrect hearsay about how things should work. Which is kind of sad, and suggest that no-one at MySQL or InnoDB ever cared sufficiently to take a serious look at this important issue.

(In order to get full group commit in MySQL there is another issue that needs to be solved. The current binary log code does not include implementation of group commit, so this also needs to be implemented. Such an implementation should be possible to do using standard techniques, and is independent of fixing of group commit in InnoDB).

This concludes the second part of the series, showing that group commit can be restored simply by removing the offending prepare_commit_mutex from InnoDB. The third and final article in the series will discuss some deeper issues that arise from looking into this part of the server code, and some interesting ideas for further improving things related to group commit.

8 comments

ext_232474 says:

April 23, 2010 at 13:23

prepare_commit_mutex is needed for xtrabackup/innodb hot backup in many cases

The prepare_commit_mutex is used to ensure that the binary log position inside of InnoDB is correct. This is used for xtrabackup/innodb hot backup if you do not use the extra perl script (which most people don’t if they don’t have MyISAM). As you mention the perl script sets a global lock and hence isn’t really a hot backup anymore, so people try to avoid it unless they have active MyISAM tables.

On recovery InnoDB outputs the position corresponding to the binary log and hence you can use that for setting up replication/point in time recovery, etc… Just removing the mutex breaks this capability.

LVM and other snapshots have a similar issue unless you also snapshot the binary logs at the same time as the InnoDB parts. This requires you to have the binary logs on the same volume, which isn’t always the case currently.

Reply
1. knielsen says:
  
  April 24, 2010 at 06:09
  
  Re: prepare_commit_mutex is needed for xtrabackup/innodb hot backup in many cases
  
  Thanks for your explanations.
  
  In any case, it seems there are numerous motivations for ensuring same commit order in engine and binlog, without breaking group commit, as discussed in the third article
  
  Reply
(Anonymous) says:

April 23, 2010 at 14:00

(Deleted comment)

Reply
1. knielsen says:
  
  April 24, 2010 at 06:10
  
  Re: prepare_commit_mutex or FLUSH TABLES WITH READ LOCK
  
  Agree, FLUSH TABLES WITH READ LOCK is to be avoided. I’m hoping something like I describe in the third article can solve the issue properly.
  
  Reply
xaprb says:

April 24, 2010 at 18:17

XtraDB doesn’t do FLUSH TABLES WITH READ LOCK

I think you should edit this part a little bit, as there is a bit of a mixup:

“It turns out that XtraDB does a FLUSH TABLES WITH READ LOCK when it extracts the binary log position.”

Not so. XtraDB doesn’t do anything at all like this. If you meant XtraBackup, it also doesn’t… as Harrison pointed out, it relies on InnoDB’s recovery routines to find out the binary log position corresponding to the state of the backup.

What I think you mean is “innobackup[ex] does a …”

innobackup is the Perl script that wraps around InnoDB hot backup. innobackupex is a modified version of that, which wraps around XtraBackup.

The FLUSH TABLES WITH READ LOCK is used only when there’s a need to backup some MyISAM tables too (on by default, can be disabled with –no-lock when using innobackupex). You can see this in the code — innobackup and innobackupex are both open-source perl scripts.

So I would suggest editing it to

“It turns out that innobackup (and XtraDB’s version, innobackupex) does a FLUSH TABLES WITH READ LOCK when it extracts the binary log position.”

Reply
knielsen says:

April 25, 2010 at 12:56

prepare_commit_mutex only needed briefly at backup snapshot time

BTW, even given that innobackup/xtrabackup itself does not do FLUSH TABLES WITH READ LOCK, and the mutex locking is needed for correct binary log position. This locking is only needed for a few seconds when the backup takes the final snapshot. It is not needed outside the backup or even for the whole duration of the backup.

So it would be sufficient to enable the mutex locking with a simple system variable just around the time where the snapshot is taken, and disable it again after, with only minimal impact on server performance.

Reply
(Anonymous) says:

November 5, 2012 at 08:59

I have one question about the question mentioned above:
why is it necessary to ensure the same commit order in InnoDB and in the binary log?
consider the following two sql:
A: update t1 set c1=c1+1 where id=1;
B: update t1 set c1=5 where id=1;
If the commit order is not sure, when A commit in innodb before B, but commit in binlog after B‘s commit.
then may cause inconsistence : master.t1.c1=5; slave.t1.c1=6; (assuming replication using statement-based replication)
Am I right?

Reply
1. knielsen says:
  
  November 5, 2012 at 09:36
  
  In your example, two transactions update the same row. Then a stronger requirement exists: the second transaction must wait until the first one has finished commit before it can even finish the statement itself, let alone start to commit. So you are right that this requires consistent commit order, but this is always true due to InnoDB row locking.
  
  The problem for group commit is more subtle, and it concerns when two transactions are completely independent and commit at the same time:
  
  A: UPDATE t1 SET b=b+1 WHERE id=1;
  B: UPDATE t1 SET c=c+1 WHERE id=2;
  
  The motivation is that we want to take a backup of the master and use it as a new slave, without blocking running transactions. XtraBackup can do this. When we restore the backup to the slave, it will restore all transactions up to some arbitrary point in time, but it will remember the position in the master binlog of the last transaction restored.
  
  If the commit order in InnoDB is the same as the binlog order, then we can configure the new slave simply with CHANGE MASTER TO and specify the position remembered in the backup.
  
  But if the commit order is not the same, then this does not work. Suppose A comes before B in binlog, but after in InnoDB commit order. Then perhaps our backup contains B but not A. Then we cannot start the slave replication from anywhere in the master binlog – we will either miss transaction A, or do transaction B twice.
  
  Reply

8 comments

Leave a Reply to knielsen Cancel reply