Dissecting the MySQL replication binlog events

For the replication project that I am currently working on in MariaDB, I wanted to understand exactly what information is needed to do full replication of all MySQL/MariaDB statements on the level of completeness that existing replication does. So I went through the code, and this is what I found.

What I am after here is a complete list of what the execution engine needs to provide to have everything that a replication system needs to be able to completely replicate all changes made on a master server. But not anything specific to the particular implementation of replication used, like binlog positions or replication event disk formats, etc.

The basic information needed is of course the query (for statement-based replication), or the column values (for row-based replication). But there are lots of extra details needed, especially for statement-based replication. I need to make sure that the replication API we are designing will be able to provide all needed information, and it was always nagging in the back of my head that there would be lots and lots of small bits in various corners that would be missing and cause problems. So it was good to get this overview. Turns out that there are a lot of details, but not that many, and it should be manageable.

All of the events that are used in replication are listed in enum Log_event_type in sql/log_event.h. So anything needed for complete replication can be found here, but mixed up with lots of other details about the MySQL binlog implementation, backwards compatibility, etc. So what follows is an extract from log_event.cc of the actual change information contained in those events.

Statement-based replication

QUERY_EVENT

The main event for statement-based replication is QUERY_EVENT. It contains the query to be executed (as a string) and some information to provide the context for correct execution. Here is the list of information:

  • SQL query.
  • Default database for the query (eg. from USE statement).
  • The setting of some server variables in effect at the time the query was run:
    • sql_mode.
    • autocommit (whether autocommit is enabled).
    • Character set and collation at various levels (see the section 9.1.4. Connection Character Sets and Collations in the MySQL manual for background on these):
      • Client (character_set_client).
      • Connection (character_set_connection).
      • Server (character_set_server).
      • Current default database (character_set_database; note that there are few statements that rely on this, comments in the code say it is only LOAD DATA).
    • foreign_key_checks (whether foreign keys are checked).
    • unique_checks (whether unique constraint checks are enforced).
    • auto_increment_offset and auto_increment_increment.
    • Time zone of the master database server.
    • Names to use for days and months; this is identified by a code that is mapped to a table of names to use in sql/sql_locale.cc.
    • sql_auto_is_null (whether SELECT ... WHERE autoinc IS NULL returns last insert id for ODBC compatibility).
  • Error code from executing the query on the master (for non-transactional statements that may still make permanent changes even though they fail mid-way; on the slave the query should fail with the same error).
  • Connection ID (this is used to correctly distinguish TEMPORARY TABLEs with same name used in different connections on the master simultaneously).

Note that not all of this information is replicated in all query events, as not all of it is needed for a given query. But a replication API must make the information available for the queries where it is needed.

INTVAR_EVENT, RAND_EVENT, and USER_VAR_EVENT

These events provide additional context for executing a query on the slave:

  • Value of LAST_INSERT_ID (for queries that reference it).
  • Value of INSERT_ID (to get same auto_increment numbers for inserts on the slave as on the master).
  • The random seed (so RAND() can return same values in queries on slaves as on the master).
  • The values for any @user_variables referenced in a query

BEGIN_LOAD_QUERY_EVENT, APPEND_BLOCK_EVENT, EXECUTE_LOAD_QUERY_EVENT, and DELETE_FILE_EVENT

These four events are used to do statement replication of LOAD DATA INFILE. The contents of the file to be loaded is sent in blocks in BEGIN_LOAD_QUERY_EVENT followed by zero or more APPEND_BLOCK_EVENT. Then the actual query is sent in EXECUTE_LOAD_QUERY_EVENT, which is a variant of QUERY_EVENT that replaces the original filename with the name of a temporary file on the slave and deletes the temporary file afterwards (DELETE_FILE_EVENT is used in certain error cases).

This is the complete story of exactly how much information needs to be provided on the master to make statement replication work as it does currently in MySQL. If you get the thought that this is a little bit scary in terms of complexity I tend to agree with you ;-). There is a lot to be said for the comparative simplicity of row-based replication (and it is also interesting to see the history of bug fixes in MySQL 5.1 that gradually have moved more and more statements to be replicated row-based (in mixed-mode binlogging) due to corner cases where statement-based replication can fail).

Still, once we have the list of information, it is not that hard to provide the information in a pluggable replication API for any implementations that want to try their luck with statement-based replication. And of course, row-based replication only handles INSERT/UPDATE/DELETE! We also need to support CREATE TABLE and similar statements, for which it is still useful to know the above exhaustive list of information that may be needed in one form or another.

Row-based replication

In row-based replication, each DML statement is binlogged in two parts. First the tables modified in the query are described with TABLE_MAP_EVENT, and second the row values changed are logged with WRITE_ROWS_EVENT, UPDATE_ROWS_EVENT, or DELETE_ROWS_EVENT.

TABLE_MAP_EVENT

The information describing modified tables in row-based replication is as follows:

  • Database name.
  • Table name.
  • List of columns in the table. For each column, the following information is included:
    • Column type (this is field->type()).
    • Column metadata (this is what is returned by field->save_field_metadata(); this is for example the maximum length of a VARCHAR, the precision and number of decimals in DECIMAL, etc.)
    • Whether the column is NULL-able.
  • Table map id; this is just an internally generated uniqie number for subsequent events to refer to the table described.

Note in particular that column names are not used/needed in current MySQL/MariaDB row-based replication. I personally think this is a good way to do it. However, in a generic API, it will make sense to make the full table definition available to implementations, each of which can choose what and how to log in terms of table metadata.

WRITE_ROWS_EVENT, UPDATE_ROWS_EVENT, and DELETE_ROWS_EVENT

These events handle replication of respectively INSERT, UPDATE, and DELETE (and similar statements like REPLACE etc.) They contain the following information:

  • Table map id, referencing a table previously described with TABLE_MAP_EVENT.
  • Value of foreign_key_checks and unique_checks, similar to statement-based binlogging (but for row-based, those two are all the context storedm though see remarks below).
  • List (bitmap really) of columns updated. This is essentially the write_set that is used in the storage engine API (but see below for explanation). For UPDATE, there are two bitmaps, one for the before image and one for the after image.
  • List of records containing the values of each column modified. There is one such record for every row update logged. For UPDATE there are two records for each row update, one for the before image (values before the update was done) and one for the after image (values after the update was done).

I must say, investigating how these row-based events are implemented in MySQL really makes the feature seem rather half-baked. There are several issues:

  • The lists/bitmaps of column updated sound useful, but in reality they are set unconditionally to include all columns! Except for NDB).
  • This also means all columns in a row are always sent, even for DELETE and UPDATE. Except for NDB, which only logs needed columns.
  • Some of the “extra” flags in the storage engine API are not included, such as HA_EXTRA_WRITE_CAN_REPLACE. This is actually a bug, as it means that a storage engine using such flags to optimise its operation will not replicate correctly. In the existing MySQL source, only NDB uses this flag, but NDB does special tricks for binlogging and slave replication which avoids this particular issue in most cases.

I strongly suspect that some of this half-baking was done in a quick-and-dirty attempt to squeeze NDB replication in. At least, there are several “this is only used by NDB” type comments in the vicinity of these things in the source code.

In any case, for the replication API, it is probably a good idea to re-think this part and make sure that the information logged for row updates is complete and sane for all reasonable use cases.

Thoughts

My idea is to have a replication API that provides for generation and consumption of events completely separate from any details of the actual format of events in the binlog or any other method used to store or process the events. This will allow replication plugins that use a completely different binlog implementation, or even has no binlog at all.

So such an API needs to provide all of the above information (to allow re-implementing the existing binlog/replication as a plugin, if for no other reason), but need not provide such information in any particular event format. In fact, I am trying to make the API so that such information need not be materialised in structures or memory buffers at all; instead relying on providing accessor methods, so that an implementation can request just the information it needs, and materialise it as or if needed.

On top of this I still think it makes sense to define a standard (but optional) materialised event format, so that more light-weight plugins can be written that can do interesting things with replication without having to implement a full new event format each time. I am still considering whether to extend the existing binlog format (which is not all that attractive, as it is not very easily extensible), or whether to define a new more flexible format (for example based on the Google protobuffer library).

More on the existing binlog format

Just for completeness, here is some additional description of the existing MySQL/MariaDB 5.1 binlog format. These are things that I believe are not required in a new API, as they are mostly internal implementation details. However, as I had to go through them anyway while finding the stuff that does need to be in the API, I will include a brief description here.

Additional query information

Some additional information, which is mostly redundant, is included with query events for statement-based binlogging:

  • Bitmap of tables affected by multi-table update (this allows to know which tables will be updated without parsing the query, eg. for filtering events based on database/table name.)
  • Time spent in query on master.
  • Catalog (I believe this is old unused stuff. Idea is that each database belongs to a catalog, but I have never seen this actually used anywhere).
  • A flag LOG_EVENT_THREAD_SPECIFIC_F which is set if the query uses TEMPORARY table (allows to get this information without parsing the query).
  • A flag LOG_EVENT_SUPPRESS_USE_F set in a few cases when the master knows that the query is independent of what the current database is (so that a possible USE statement can be optimised away).

Binlog specific events

These are events that are specific to the binlog implementation:

XID_EVENT

This is used to record a transaction ID for each transaction written to the binlog in 2-phase commit. This recorded ID is needed during crash recovery on the master to know which prepared transactions in transactional engines need to be recovered to get consistency with what is in the binlog. It is not used on the slave in replication (though this events implies a COMMIT, which _does_ have effect on the slave, of course.)

FORMAT_DESCRIPTION_EVENT

This event is written at the start of every binlog file. It provides to slaves reading the binlog the master server version and the event size of all following events, thereby providing some facilities for extending event formats while maintaining backwards compatibility.

STOP_EVENT

This is logged when the master shuts down gracefully (though I do not think this is used much, if any)

ROTATE_EVENT

This is logged at the end of a binlog file when the master starts a new binlog file. It is needed by the slave to reset it’s master binlog position so that the IO thread can proceed correctly from the next binlog file (incidentally, it is a clear weakness in the binlog implementation that the slaves need knowledge about binlog file names and data offsets on the master server, and is a cause of much complexity when switching masters in advanced replication topologies. Something that really needs improvements in the near future).

INCIDENT_EVENT

This is logged by the master when something bad happens that may cause replication to fail/diverge, so that the slave can be notified of the problem and stop, informing the DBA/sysadm to resolve the issue.

Obsolete events

Finally there are a number of events that are no longer generated (but which are still important for the slave replication code to handle to be able to work with masters of older versions):

LOAD_EVENT, NEW_LOAD_EVENT, CREATE_FILE_EVENT, and EXEC_LOAD_EVENT

Various old events for handling LOAD DATA INFILE (as can be seen, LOAD DATA INFILE has had some changes in replication over the years :-).

START_EVENT_V3

Old version of FORMAT_DESCRIPTION_EVENT.

PRE_GA_WRITE_ROWS_EVENT, PRE_GA_UPDATE_ROWS_EVENT, and PRE_GA_DELETE_ROWS_EVENT

Old versions of the row-based replication binlog events.

SLAVE_EVENT

Not used, I think it may have been related to some feature that was never completed.

2 comments

  1. Another good entry

    Hi Kristian, a high quality post as usual. Great to see this information managing to escape the code! I have started a write-up on Ndb replication that may see the light of day eventually.
    Frazer

  2. provides access

    Hello, I discovered your blog in a new directory of blogs. I dont know how your blog came up. Your weblog looks good. Have a good day.

Leave a comment

Your email address will not be published. Required fields are marked *