For the replication project that I am currently working on in MariaDB, I wanted to understand exactly what information is needed to do full replication of all MySQL/MariaDB statements on the level of completeness that existing replication does. So I went through the code, and this is what I found.
What I am after here is a complete list of what the execution engine needs to provide to have everything that a replication system needs to be able to completely replicate all changes made on a master server. But not anything specific to the particular implementation of replication used, like binlog positions or replication event disk formats, etc.
The basic information needed is of course the query (for statement-based replication), or the column values (for row-based replication). But there are lots of extra details needed, especially for statement-based replication. I need to make sure that the replication API we are designing will be able to provide all needed information, and it was always nagging in the back of my head that there would be lots and lots of small bits in various corners that would be missing and cause problems. So it was good to get this overview. Turns out that there are a lot of details, but not that many, and it should be manageable.
All of the events that are used in replication are listed in
enum Log_event_type in
sql/log_event.h. So anything needed for complete replication can be found here, but mixed up with lots of other details about the MySQL binlog implementation, backwards compatibility, etc. So what follows is an extract from
log_event.cc of the actual change information contained in those events.
The main event for statement-based replication is QUERY_EVENT. It contains the query to be executed (as a string) and some information to provide the context for correct execution. Here is the list of information:
- SQL query.
- Default database for the query (eg. from
- The setting of some server variables in effect at the time the query was run:
autocommit(whether autocommit is enabled).
- Character set and collation at various levels (see the section 9.1.4. Connection Character Sets and Collations in the MySQL manual for background on these):
- Client (
- Connection (
- Server (
- Current default database (
character_set_database; note that there are few statements that rely on this, comments in the code say it is only
- Client (
foreign_key_checks(whether foreign keys are checked).
unique_checks(whether unique constraint checks are enforced).
- Time zone of the master database server.
- Names to use for days and months; this is identified by a code that is mapped to a table of names to use in
SELECT ... WHERE autoinc IS NULLreturns last insert id for ODBC compatibility).
- Error code from executing the query on the master (for non-transactional statements that may still make permanent changes even though they fail mid-way; on the slave the query should fail with the same error).
- Connection ID (this is used to correctly distinguish
TEMPORARY TABLEs with same name used in different connections on the master simultaneously).
Note that not all of this information is replicated in all query events, as not all of it is needed for a given query. But a replication API must make the information available for the queries where it is needed.
These events provide additional context for executing a query on the slave:
- Value of
LAST_INSERT_ID(for queries that reference it).
- Value of
INSERT_ID(to get same auto_increment numbers for inserts on the slave as on the master).
- The random seed (so RAND() can return same values in queries on slaves as on the master).
- The values for any
@user_variablesreferenced in a query
These four events are used to do statement replication of
LOAD DATA INFILE. The contents of the file to be loaded is sent in blocks in
BEGIN_LOAD_QUERY_EVENT followed by zero or more
APPEND_BLOCK_EVENT. Then the actual query is sent in
EXECUTE_LOAD_QUERY_EVENT, which is a variant of
QUERY_EVENT that replaces the original filename with the name of a temporary file on the slave and deletes the temporary file afterwards (
DELETE_FILE_EVENT is used in certain error cases).
This is the complete story of exactly how much information needs to be provided on the master to make statement replication work as it does currently in MySQL. If you get the thought that this is a little bit scary in terms of complexity I tend to agree with you ;-). There is a lot to be said for the comparative simplicity of row-based replication (and it is also interesting to see the history of bug fixes in MySQL 5.1 that gradually have moved more and more statements to be replicated row-based (in mixed-mode binlogging) due to corner cases where statement-based replication can fail).
Still, once we have the list of information, it is not that hard to provide the information in a pluggable replication API for any implementations that want to try their luck with statement-based replication. And of course, row-based replication only handles
DELETE! We also need to support
CREATE TABLE and similar statements, for which it is still useful to know the above exhaustive list of information that may be needed in one form or another.
In row-based replication, each DML statement is binlogged in two parts. First the tables modified in the query are described with
TABLE_MAP_EVENT, and second the row values changed are logged with
The information describing modified tables in row-based replication is as follows:
- Database name.
- Table name.
- List of columns in the table. For each column, the following information is included:
- Column type (this is
- Column metadata (this is what is returned by
field->save_field_metadata(); this is for example the maximum length of a
VARCHAR, the precision and number of decimals in
- Whether the column is NULL-able.
- Column type (this is
- Table map id; this is just an internally generated uniqie number for subsequent events to refer to the table described.
Note in particular that column names are not used/needed in current MySQL/MariaDB row-based replication. I personally think this is a good way to do it. However, in a generic API, it will make sense to make the full table definition available to implementations, each of which can choose what and how to log in terms of table metadata.
These events handle replication of respectively
DELETE (and similar statements like
REPLACE etc.) They contain the following information:
- Table map id, referencing a table previously described with
- Value of
unique_checks, similar to statement-based binlogging (but for row-based, those two are all the context storedm though see remarks below).
- List (bitmap really) of columns updated. This is essentially the
write_setthat is used in the storage engine API (but see below for explanation). For
UPDATE, there are two bitmaps, one for the before image and one for the after image.
- List of records containing the values of each column modified. There is one such record for every row update logged. For
UPDATEthere are two records for each row update, one for the before image (values before the update was done) and one for the after image (values after the update was done).
I must say, investigating how these row-based events are implemented in MySQL really makes the feature seem rather half-baked. There are several issues:
- The lists/bitmaps of column updated sound useful, but in reality they are set unconditionally to include all columns! Except for NDB).
- This also means all columns in a row are always sent, even for
UPDATE. Except for NDB, which only logs needed columns.
- Some of the “extra” flags in the storage engine API are not included, such as
HA_EXTRA_WRITE_CAN_REPLACE. This is actually a bug, as it means that a storage engine using such flags to optimise its operation will not replicate correctly. In the existing MySQL source, only NDB uses this flag, but NDB does special tricks for binlogging and slave replication which avoids this particular issue in most cases.
I strongly suspect that some of this half-baking was done in a quick-and-dirty attempt to squeeze NDB replication in. At least, there are several “this is only used by NDB” type comments in the vicinity of these things in the source code.
In any case, for the replication API, it is probably a good idea to re-think this part and make sure that the information logged for row updates is complete and sane for all reasonable use cases.
My idea is to have a replication API that provides for generation and consumption of events completely separate from any details of the actual format of events in the binlog or any other method used to store or process the events. This will allow replication plugins that use a completely different binlog implementation, or even has no binlog at all.
So such an API needs to provide all of the above information (to allow re-implementing the existing binlog/replication as a plugin, if for no other reason), but need not provide such information in any particular event format. In fact, I am trying to make the API so that such information need not be materialised in structures or memory buffers at all; instead relying on providing accessor methods, so that an implementation can request just the information it needs, and materialise it as or if needed.
On top of this I still think it makes sense to define a standard (but optional) materialised event format, so that more light-weight plugins can be written that can do interesting things with replication without having to implement a full new event format each time. I am still considering whether to extend the existing binlog format (which is not all that attractive, as it is not very easily extensible), or whether to define a new more flexible format (for example based on the Google protobuffer library).
More on the existing binlog format
Just for completeness, here is some additional description of the existing MySQL/MariaDB 5.1 binlog format. These are things that I believe are not required in a new API, as they are mostly internal implementation details. However, as I had to go through them anyway while finding the stuff that does need to be in the API, I will include a brief description here.
Additional query information
Some additional information, which is mostly redundant, is included with query events for statement-based binlogging:
- Bitmap of tables affected by multi-table update (this allows to know which tables will be updated without parsing the query, eg. for filtering events based on database/table name.)
- Time spent in query on master.
- Catalog (I believe this is old unused stuff. Idea is that each database belongs to a catalog, but I have never seen this actually used anywhere).
- A flag
LOG_EVENT_THREAD_SPECIFIC_Fwhich is set if the query uses
TEMPORARYtable (allows to get this information without parsing the query).
- A flag
LOG_EVENT_SUPPRESS_USE_Fset in a few cases when the master knows that the query is independent of what the current database is (so that a possible
USEstatement can be optimised away).
Binlog specific events
These are events that are specific to the binlog implementation:
This is used to record a transaction ID for each transaction written to the binlog in 2-phase commit. This recorded ID is needed during crash recovery on the master to know which prepared transactions in transactional engines need to be recovered to get consistency with what is in the binlog. It is not used on the slave in replication (though this events implies a
COMMIT, which _does_ have effect on the slave, of course.)
This event is written at the start of every binlog file. It provides to slaves reading the binlog the master server version and the event size of all following events, thereby providing some facilities for extending event formats while maintaining backwards compatibility.
This is logged when the master shuts down gracefully (though I do not think this is used much, if any)
This is logged at the end of a binlog file when the master starts a new binlog file. It is needed by the slave to reset it’s master binlog position so that the IO thread can proceed correctly from the next binlog file (incidentally, it is a clear weakness in the binlog implementation that the slaves need knowledge about binlog file names and data offsets on the master server, and is a cause of much complexity when switching masters in advanced replication topologies. Something that really needs improvements in the near future).
This is logged by the master when something bad happens that may cause replication to fail/diverge, so that the slave can be notified of the problem and stop, informing the DBA/sysadm to resolve the issue.
Finally there are a number of events that are no longer generated (but which are still important for the slave replication code to handle to be able to work with masters of older versions):
Various old events for handling
LOAD DATA INFILE (as can be seen,
LOAD DATA INFILE has had some changes in replication over the years :-).
Old version of
PRE_GA_WRITE_ROWS_EVENT, PRE_GA_UPDATE_ROWS_EVENT, and PRE_GA_DELETE_ROWS_EVENT
Old versions of the row-based replication binlog events.
Not used, I think it may have been related to some feature that was never completed.