New binlog implementation in MariaDB 12.3

I have recently completed a large project to implement a new improved binlog format for MariaDB. The result will be available shortly in the upcoming MariaDB 12.3.1 release.

In this article, I will give a short overview of the new binlog implementation. For more details, check the documentation which is in the source tree as the file Docs/replication/binlog.md, or here: https://github.com/MariaDB/server/blob/knielsen_binlog_in_engine/Docs/replication/binlog.md

Using the new binlog

To enable the new binlog, configure the MariaDB server with binlog_storage_engine=innodb.

Additionally, the binlog must itself be enabled as usual using the option log_bin. Note that no argument can be given to the log_bin option (this is to avoid confusion with the meaning of such argument as the name to use for the old binlog format, as the new binlog file names are fixed).

    binlog_storage_engine=innodb
    log_bin

When the new binlog file is enabled and the server restarted, any old binlog files are no longer available. See the above-referenced documentation for options on how to migrate old binlogs of an existing server.

Benefits of the new binlog

For the user, the new binlog format brings two main benefits.

First, for users that are running with --innodb-flush-log-at-trx-commit set to 2 or 0 for performance reasons, the new binlog will make the binlog crash-safe (when used with InnoDB tables). This means that if the server crashes or the machine loses power, the restarted server will recover itself into a consistent state, including the state of replication and consistency between the binlog and the InnoDB table contents. With the old binlog format, such a crash could easily leave the binlog in a different state than the InnoDB table data, which then causes replication slaves to diverge from the master. To have the old binlog be crash-safe required setting both --sync-binlog=1 and --innodb-flush-log-at-trx-commit=1.

Second, for users that are running with --innodb-flush-log-at-trx-commit set to 1 because they need durability of commits, the new binlog will provide a large speedup of the time taken to commit. Because the new binlog is integrated with InnoDB, only half as many flushes to disk of buffers are needed per commit as with the old binlog.

Thus, the primary user-visible benefits of the new binlog is greatly improved speedup of transaction commits.

The speedup that will be obtained will be completely dependent on the actual workload of the application and on the hardware used for running the database. The speedup will be greater when transactions are small; when the transaction parallelism is modest; and when disk writes have a higher latency (like consumer-grade SSDs or network-attached storage). This is because the new binlog particularly reduces the amount of disk writes that have to happen during commit of a batch of parallel transactions. So if there are many small individual transactions and writes are expensive, the speedup can be huge. If there are few individual transactions, most transactions run in parallel and batch up in a single group commit, and/or disk writes are fast, the speedup will be smaller (but can still be significant).

Technical background

The core of a transactional system like a database – but also for example a file-system – is its transactional log, also referred to as the write-ahead log or redo log, amongst others:

https://en.wikipedia.org/wiki/Transaction_log

This log is the core of how the database achieves a high throughput of updates to data stored on its disks, while simultaneously being able to gracefully recover into a consistent state if the system crashes during operation.

Unfortunately MariaDB does not have a central implementation of its Transaction Log. The main storage engine, InnoDB, has its own implementation, which is separate from the log used by other parts of the server; in particular the (old) binlog is a separate “transaction log”, and there are other logs used by the Aria storage engine, by DDL operations, etc. Some parts do not even have any transaction log backing them, and are thus not crash-safe. Arguably, this lack of a central transaction log is the biggest architectural limitation of MariaDB currently.

For the binlog in particular, having the binlog separate from the InnoDB write-ahead log causes not just a lot of code complexity, but also a huge performance cost. Because of the two separate logs, it is necessary to use a two-phase commit protocol between the two. This requires two separate synchronous disk writes per (group) commit, otherwise a crash would leave the date in one inconsistent with the other, and replication would break. The need to have these two disk flushes is a huge overhead.

The new binlog implementation fixes this, by re-implementing the binlog data format inside of InnoDB. Similar to InnoDB tablespace files, the new binlog files are now being handled through the InnoDB write-ahead log. This means that when a transaction commit happens, both the table data and the binlog data get written through the InnoDB write-ahead log. The write of data to binlog files can happen later, asynchronously and in an efficient manner. The InnoDB write-ahead log will be used to recover both table data and binlog data into a consistent state, and the overhead of being able to do so is being re-used for the binlog part. Thus, the overhead of two-phase commit and binlog disk flushes is gone, which is a major contribution to the performance improvements of the new binlog.

More subtle, but at least as important, is the improvements under the hood of the code implementing the new binlog.

The new binlog is implemented in InnoDB through an extension of the storage engine API. This means that another storage engine could in principle implement its own version of the binlog, which would be beneficial for users that were mainly using that storage engine for their data. But perhaps more importantly, it means that there is now a well-defined API for how the binlog writes work and what operations are possible on it. This gives a much cleaner separation between the file format and operations used to store the binlog on disk and read it back, as opposed to the actual contents of the binlog in the form of replication events used by slaves to replicate the master’s data.

And the actual file format of the new binlog is also greatly improved.

The old binlog is a very naive implementation, it is just a flat file with each individual binlog event written as just a raw sequence of bytes one after the other. This is inefficient for the underlying file system, as each write has to update in two places on disk: the actual data written to the end of the file; and the metadata recording the increase in file length. It also makes it impossible to start reading the binlog file from an arbitrary place, since the start of a new event cannot be distinguished from arbitrary data contained inside an event.

The new binlog has a proper page-based file, which can be pre-allocated efficiently on the file system using eg. posix_fallocate(), and written efficiently page-by-page to the disk. And the binlog data records have proper framing within pages, so that it is possible to look at an arbitrary page in the file and understand what kind of data is there and where one record ends and the next one begins. Having a good page-based file format for the binlog is a great improvement, and something that I have desired for many years.

In many ways, the main benefits to me of the new binlog format is not so much the immediate performance gains, though these are quite substantial already. The really important benefits are the possibilities that are now open for future development and improvements of the binlog and replication, many things that were previously impossible to achieve due to the limitations and convoluted code and design.

For example, with the new binlog, large transactions are now no longer constrained to be written into the binlog as a single block at commit time; they can be written in pieces spread out over the binlog files as the transaction executes. This opens the possibility for having the slaves replicate these pieces optimistically in parallel with the transaction running on the master. This has the potential to greatly reduce the replication lag caused by long-running transactions.

Another example is if and when InnoDB is extended with an option for log archiving, so that the InnoDB write-ahead log is not overwritten cyclicly, but written as a sequence of files containing the complete redo data. Then the new binlog API could be used to implement the binlog data completely inside the InnoDB write-ahead log, so that replication could simply read the binlog data out of the archived log files, and the overhead of having separate binlog files could be eliminated completely.

And there are many other improvements, small and large, that will now be possible to do going forward, based on the improvements done in this project.

Final words

Thanks for reading this far! I encourage you to try out the new binlog and see how it works. Any questions or reports of problems are welcome, please direct all queries to the developers@ or discuss@ mailing lists:

Leave a comment

Your email address will not be published. Required fields are marked *