Debugging memory leaks in plugins with Valgrind

I had an interesting IRC discussion the other day with Monty Taylor about what turned out to be a limitation in Valgrind with respect to debugging memory leaks in dynamically loaded plugins.

Monty Taylor’s original problem was with Drizzle, but as it turns out, it is common to all of the MySQL-derived code bases. When there is a memory leak from an allocation in a dynamically loaded plugin, Valgrind will detect the leak, but the part of the stack trace that is within the plugin shows up as an unhelpful three question marks “???”:

==1287== 400 bytes in 4 blocks are definitely lost in loss record 5 of 8
==1287==    at 0x4C22FAB: malloc (vg_replace_malloc.c:207)
==1287==    by 0x126A2186: ???
==1287==    by 0x7C8E01: ha_initialize_handlerton(st_plugin_int*) (handler.cc:429)
==1287==    by 0x88ADD6: plugin_initialize(st_plugin_int*) (sql_plugin.cc:1033)

Which tells you little more than that there is a leak in one of your plugins.

After trying a couple of things, we found that this is a known limitation in Valgrind in relation to code that is loaded with dlopen() and later unloaded with dlclose(): http://bugs.kde.org/show_bug.cgi?id=79362

The basic problem is that Valgrind records the location of the malloc() call as just a memory address. And when the memory leak check is performed after the end of program execution, the plugin has been unloaded with dlclose(), and the recorded memory address is therefore no longer valid.

The problem is specific to memory leak checks, which are done only after the code has been unloaded. Other checks (like use of uninitialised values and use-after-free) work fine with full information in the stack traces, as such checks are done while the plugin code is still loaded into memory. But the memory leak checks are arguably among the most useful cheks Valgrind does, as Valgrind is often the only way to find and fix critical memory leaks efficiently.

Fortunately, once the issue was understood, we had an easy work-around: disable the dlclose() call in the server plugin code, and the leak is then detected with full information in the stack trace. Unfortunately this introduces a leak of its own, since now the memory allocated in dlopen() is never freed, so we get another spurious Valgrind memory leak warning.

Another possible way to get the same effect is to pass the RTLD_NODELETE flag to dlopen() to achieve the same effect, though I did not try this yet.

A possibly better work-around (which I also did not try yet) is one suggested in the above referenced Valgrind feature request. By adding the offending plugin(s) as LD_PRELOAD when starting the server, the plugin code will not actually be unloaded in dlclose(), so stack traces should be available without any spurious leak warnings from Valgrind. However, this will not work well if some of the dynamic plugins need a particular load order (according to the suggestion in the feature request). I also need to check if this actually works for plugins (like storage engines) that has link dependencies to symbols in the main program. But it might be a good option if it can be made to work.

(At first I was surprised to learn that this was a problem in MySQL and MariaDB, as I never saw it before. But I suppose the reason is that we so far have built most plugins as built-in, rather than as dynamically loaded .so files. The problem is likely to occur more frequently as we are moving to do more and more with plugins in MariaDB, so it is nice to know a work-around. Thanks, Monty!)