MySQL/MariaDB replication: applying events on the slave side

Working on a new set of replication APIs in MariaDB, I have given some thought to the generation of replication events on the master server.

But there is another side of the equation: to apply the generated events on a slave server. This is something that most replication setups will need (unless they replicate to non-MySQL/MariaDB slaves). So it will be good to provide a generic interface for this, otherwise every binlog-like plugin implementation will have to re-invent this themselves.

A central idea in the current design for generating events is that we do not enforce a specific content of events. Instead, the API provides accessors for a lot of different information related to each event, allowing the plugin flexibility in choosing what to include in a particular event format. For example, one plugin may request column names for a row-based UPDATE event; another plugin may not need them and can avoid any overhead related to column names simply by not requesting them.

To get the same flexibility on the slave side, the roles of plugin and API are reversed. Here, the plugin will have a certain pre-determined (by the particular event format implemented) set of information related to the event available. And the API must make do with whatever information it is provided (or fail gracefully if essential information is missing).

My idea is that the event application API will provide corresponding events to the events in the generation API. Each application event will have “provider” methods corresponding to the accessor methods of the generator API. So the plugin that wants to apply an event can obtain an event generator object, call the appropriate provider methods for all the information available, and finally ask the API to execute the event with the provided context information.

This is only an abstract idea at this point; there are lots of details to take care of to make this idea into a concrete design proposal. And I have not fully decided if such an API will be part of the replication project or not. But I like the idea so far.

Understanding how MySQL binlog events are applied on the slave

I wanted to get a better understanding of what is involved in an event application API like the one described above. So I did a similar exercise to the one I wrote about in my last post, where I went through in detail all the information that the existing MySQL binlog format includes. This time I went through the details in the code that applies MySQL binlog events on a slave.

Again, I concentrate on the actual events that change the database, ignoring (most) details that relate only to the particular binlog format used by MySQL (and there are quite a few :-).

At the top level, the slave SQL thread (in exec_relay_log_event()) reads events (next_event()) from the relay logs and executes (apply_event_and_update_pos()) them.

There are a number of intricate details here relating to switching to a new relay log (and purging old ones), and re-winding in the relay log to re-try a failed transaction (eg. in case of deadlock or the like). This is mostly specific to the particular binlog implementation.

The actual data changes are mostly done in the do_apply_event() methods of each event class in sql/log_event.cc. I will go briefly through this method for the events that are used to change actual data in current MySQL replication. It is relatively easy to read a particular detail out of the code, since it is all located in these do_apply_event() methods.

(One problem however is that there are special cases sprinkled across the code where special action is taken (or not taken) when running in the slave SQL thread. I have not so far tried to determine the full list of such special cases, or access how many there are).

Query_log_event

The main task done here is to set up various context in the THD, execute the query, and then perform necessary cleanup. If the query throws an error, there is also some fairly complex logic to handle this error correctly; for example to ignore certain errors, to require same error as on master (if the query failed on the master in some particular way), and to re-try the query/transaction for certain errors (like deadlock).

There are also some hard-coded special cases for NDB (this seems to be a common theme in the replication code).

The main think that to my eyes make this part of the code complex is the set of actions taken to prepare context before executing the query, and to clean up after execution. Each individual step in the code is in fact relatively easy to follow (and often the commenting is quite good). The problem is that there are so many individual steps. It is very hard to feel sure that exactly this set of actions is sufficient (and that none are redundant for that matter).

For example, code like this (not complete, just a random part of the setup):

    thd->set_time((time_t)when);
    thd->set_query((char*)query_arg, q_len_arg);
    VOID(pthread_mutex_lock(&LOCK_thread_count));
    thd->query_id = next_query_id();
    VOID(pthread_mutex_unlock(&LOCK_thread_count));
    thd->variables.pseudo_thread_id= thread_id;

It seems very easy to forget to assign query_id or whatever if one was to write this from scratch. That is something I would really like to improve in an API for replication plugins: it should be possible to understand completely exactly what setup and cleanup is needed around event execution, and there should be appropriate methods to achieve such setup/cleanup.

Another thing that is interesting in the code for Query_log_event::do_apply_event() is that the code does strcmp() on the query against keywords like COMMIT, SAVEPOINT, and ROLLBACK, and follows a fairly different execution path for these. This seems to bypass the SQL parser! But in reality, these particular queries are generated on the master with special code in the server that carefully restricts the possible format (eg. no whitespace or comments etc). So in effect, this is just a way to hack in special event types for these special queries without actually adding new binlog format events.

Rand_log_event, Intvar_log_event, and User_var_log_event

The action taken for these events is essentially to update the THD with the information in the event: value of random seed, LAST_INSERT_ID/INSERT_ID, or @user_variable.

Xid_log_event

On the slave side, this event is essentially a COMMIT. One thing that springs to mind is how different the code to handle this event is from the special code in Query_log_event that handles a query “COMMIT”. Again, it is very hard to tell from the code if this is a bug, or if the different-looking code is in fact equivalent.

Begin_load_query_log_event, Append_block_log_event, Delete_file_log_event, and Execute_load_query_log_event

This implements the LOAD DATA INFILE query, which needs special handling as it originally references a file on the master server (or on the client machine). The actual execution of the query is handled the same way as for normal queries (Execute_load_query_log_event is a sub-class of Query_log_event), but some preparation is needed first to write the data from the events in the relay log into a temporary file on the slave.

The main thing one notices about this code is how it handles re-writing the LOAD DATA query to use a new temporary file name on the slave. Take for example this query:

    LOAD DATA CONCURRENT /**/ LOCAL INFILE 'foobar.dat' REPLACE INTO /**/ TABLE ...

The event includes offsets into the query string of the two places marked /**/ in the example. This part of the query is then re-written in the slave code (so it is not just replacing the filename). Again, the code by-passes the SQL parser, it just so happens that the SQL syntax in this case is sufficiently simple that this is not too hackish to do. If one were to check, one would probably see that any user comments in this particular part of the query string disappear in the slave binlog if --log-slave-updates is enabled.

Table_map_log_event

This just links a struct RPL_TABLE_LIST into a list, containing information about the tables described by the event. The actual opening of the table is done when executing the first row event (WRITE/UPDATE/DELETE).

Write_rows_log_event, Update_rows_log_event, and Delete_rows_log_event

These are what handle the application of row-based replication events on a slave.

The first of the row events causes some setup to be done, a partial extract is this:

    lex_start(thd);
    mysql_reset_thd_for_next_command(thd, 0);
    thd->transaction.stmt.modified_non_trans_table= FALSE;
    query_cache.invalidate_locked_for_write(rli->tables_to_lock);

(As remarked for Query_log_event, while row event application setup is somewhat simpler, it still appears a bit magic that exactly these setups are sufficient and necessary).

The code also switches to row-based binlogging for the following row operations (this is for --log-slave-updates, as it is not possible to binlog the application of row-based events as statement-based events in the slave binlog). This is by the way an interesting challenge for a generic replication API: how does one handle binlogging of events applied on a slave, for daisy-chaining replication servers? This of course gets more interesting if one were to use a different binlogging plugin on the slave than on the master. I need to think more about it, but this seems to be a pretty strong argument that a generic event application API is needed, which hooks into event generators to properly generate all needed events for the updates done on slaves. Another important aspect is the support of a global transaction ID, that will identify a transaction uniquely across an entire replication setup to make migrating slaves to a new master easier. Such a global transaction ID also needs to be preserved in a slave binlog when replicating events from a master.

During execution of the first row event, the code also sets the flags for foreign key checks and unique checks that is included in the event from the master. And it checks if the event should be skipped (for --replicate-ignore-db and friends).

A check is made to ensure that the table(s) on the slave are compatible with the tables on the master (as described in the table map event(s) received just before the row event(s)). The MySQL row-based replication has a fair bit of flexibility in terms of allowing differences in tables between master and slave, such as allowing different column types, different storage engine, or even extra columns on the slave table. In particular allowing extra columns raises some issues about default values etc. for these columns, though I did not really go into details about this.

Again, there are a number of hard-coded differences for NDB.

For Write_rows_log_event, flags need to be set to ensure that column values for AUTO_INCREMENT and TIMESTAMP columns are taken from the supplied values, not auto-generated. If slave_exec_mode=IDEMPOTENT, an INSERT that fails due to already existing row does not cause replication to fail; instead an UPDATE is tried, or in some cases (like if there are foreign keys) a DELETE + re-tried INSERT. There is also code to hint storage engines about the approximate number of rows that are part of a bulk insert.

For Update_rows_log_event and Delete_rows_log_event, the code needs to locate the row to update/delete. This is done by primary key if one exists on the slave table (the current binlogging always includes every column in the before image of the row, so the primary key value of the row to modify is always available). But there is also support for tables with no primary key, in which the first index on the table is used to locate the row (if any), or failing that a full table scan. This btw. is a good reminder to not use row-based replication with primary-key-less tables with non-trivial amount of rows: every row operation applied will need a full table scan!

Finally, the values from the event are unpacked into a row buffer in the format used by MySQL storage engines, and the read_set and write_set are set up (current replication always includes all columns in row operations), before the actual ha_write_row(), ha_update_row(), or ha_delete_row() call into the storage engine handler is made to perform the actual update. Note that a single row event can include multiple rows, which are applied one after the other.

Final words

And that is it! Quite a bit of detail, but again I found it very useful to create this complete overview; it will make things easier when re-implementing this in a new replication API.