At the start of this week, we suffered a corruption of our main 5.1 source code repository at MySQL. No data was lost, but I spent most of four working days on cleaning up the corruption, Monty spent one day, and many other people had to spend time on this or were stalled in their work while the problem was being resolved. Including the usual stories with fetching off-site backup tapes only to find them broken, etc.
Our source code repository is the centre that all our work in Engineering revolves around, and it just has to be stable. The confidence in the revision control software that we use suffers greatly from such an experience, and the lost confidence can never really be restored.
But there is a good lesson in this for MySQL, I think.
Like revision control software, MySQL is used by our users to store their valuable data. The database is the center around which applications revolve, and it must be stable. If our customers suffer loss or corruption of data due to bugs in MySQL, the consequential loss of confidence will be impossible to restore.
There are a number of tools and procedures used to keep tight control of
the quality of the server code. For example:
- New code undergoes code review by two other developers before being accepted into the main repositories.
- We have an open bug database. Everyone can open a bug, and everyone can see bugs that are open, or that were fixed in the past.
- The server is available for community testing right from the early alpha versions. Users can test new versions early (and are rewarded for doing so; MySQL has a wow to fix every repeatable bug reported in the bug DB, so by testing early releases users can make sure that the server will work in their applications once it reaches GA).
- We have a very comprehensive automatic test suite. For every bug fixed, we add a test to the test suite so that the same bug will never sneak in again un-noticed.
- We have a tool ‘autopush’ that runs the entire test suite before code is pushed to the main repository, and rejects new code if the testsuite fails even a single test.
- The test suite runs with a custom debugging memory allocator
my_malloc
that tests for memory leaks. Of course, a single missedmy_free
during the test suite is considered a testsuite failure. (Since a database server must be able to run uninterrupted for indefinite periods of time, any memory leak is a serious error). - We use Valgrind to catch memory leaks in third-party libraries (which do not use
my_malloc
) and to catch pointer errors and other memory related errors. - We use the GCov program to check that newly added code includes sufficient test cases to cover all aspects of new functionality (GCov is a tool for GCC that reports how many times each line of code is executed during a test program run).
- We have a tool ‘Pushbuild’ that builds and tests the server source every time new code is pushed. Builds in pushbuild include multiple processors and OS’es (Pentium, Opteron, Sparc, PowerPC, …; Linux, Windows, Solaris, HPUX, QNX, …); building with full feature set or with just a few features enabled; debug and optimized builds; Valgrind tests; GCov tests; and others. (There has been talk about making reports from pushbuild available externally; if you think this is a good idea drop a comment, and it may happen sooner).
So overall, MySQL is in very good shape, quality wise. But it is still good
to remind ourselves from time to time why we do this, and why it is
important.
Insurance pays off
Backups are like insurance — you pay into the system in the event you need a payout. With backups you put money in, although it comes in the forms of computer resources like CPU time and disk, as well as the administrative overhead of managing the backups. And of course actual money if you pay an offsite service or for software, etc.
And much like insurance, when you need a backup you are relieved and extremely grateful it is there.
So it’s nice to have a reminder every once in a while WHY you put all that work and time and money into backups. Just not too often. 🙂
I’m glad MySQL 1) is prepared to deal with problems and 2) is responsible enough to admit when there was a problem.
I hope I’ll never have to face this situation, it’s definitely overwhelming. We keep a data center security and I hope that’s all it takes to avoid data corruption. You are right: backup is like insurance.