MySQL/MariaDB single-threaded performance regressions, and a lesson in thread synchronisation primitives

I took a quick look at MariaDB 10.0 single-threaded performance (simple read-only sysbench). One thing immediately leaps to the eye, and I thought it worthy of mention. It contains an important lesson about the use of synchronisation primitives and in particular “atomic operations” in MariaDB (and MySQL).

I am using the Linux perf tool on this sysbench command:

  sysbench --num-threads=1 --test=oltp --oltp-test-mode=simple --oltp-read-only --oltp-skip-trx

Look at the top offender in the output from perf report:

  1,54%  mysqld  mysqld               [.] set_thread_state_v1

The only thing this does is set a string for SHOW PROCESSLIST (and the like) about what the thread is doing. And we are spending a whopping 1.5% of the total time doing this.

And why? That becomes clear when looking at the disassembly (extract):

  2d:   mov    0x7b0(%rbx),%eax
  33:   mov    %eax,%edx
        lock   cmpxchg %eax,0x7b0(%rbx)
        jne    33
        and    $0xfffffffc,%edx
        add    $0x1,%edx
        xchg   %edx,0x7b0(%rbx)
        mov    %rbp,0x7b8(%rbx)
        mov    %ecx,0x7c0(%rbx)
        mov    0x7b0(%rbx),%eax
  5e:   mov    %eax,%edx
        lock   cmpxchg %eax,0x7b0(%rbx)
        jne    5e
        and    $0xfffffffc,%edx
        add    $0x6,%edx
        xchg   %edx,0x7b0(%rbx)

No less than two locked cmpxchg instructions. Each of these implies a full memory barrier, which is really expensive.

To achieve their very high performance, modern CPUs process many instructions at the same time, a hundred or more is possible. But for a full memory barrier, the CPU needs to stop and ensure that all pending store instructions have fully reached the last-level cache. Only then can it start processing any following instructions involving a load. This is not something we should be doing every time a thread moves to a new stage of query processing.

I am not really familiar with the performance schema code (nor am I sure I want to be). But let us make some guess at what the code is trying to do here. Here is the code from which the first locked instruction originates:

  /**
    Execute an allocated to dirty transition.
    This transition should be executed by the writer that owns the record,
    before the record is modified.
  */
  void allocated_to_dirty(void)
  {
    uint32 copy= PFS_atomic::load_u32(&m_version_state);
    uint32 new_val= (copy & VERSION_MASK) + PFS_LOCK_DIRTY;
    /* We own the record, no need to use compare and swap. */
    PFS_atomic::store_u32(&m_version_state, new_val);
  }

So perhaps the idea is that only the thread itself can modify its state, but other threads can read it. And we want the update to be fast (I suppose updating the state is more important than inspecting it). So the developers wanted to avoid a lock that could block the writer. Instead, it sets a flag to mark that the data is being modified before the update. And clears the flag after. And a reader can then check the flag; if the flag was modified during a read, the data is potentially inconsistent, and the read can be re-tried.

Now, this really is poorly designed. The PFS_atomic::load_u32() is implemented using my_atomic_load32() and my_atimic_store32(), which is inappropriate here. The my_atomic() facility implies a full memory barrier in all of the operations, which makes them somewhat less useful for performance-critical code.

All we need here is that the setting of the update flag becomes visible before changes to the state, and that the clearing of the flag becomes visible after. Thus it should be sufficient with a pair of write memory barriers in the writer (and corresponding read memory barriers in the reader).

The write and read memory barriers are much less expensive than a full barrier. All they do is prevent a store (respectively a load) before the barrier to be satisfied before a store (load) after the barrier. There is no need for the CPU to stop execution. In fact, on x86, there is usually nothing special to do, because stores and loads are never re-ordered unless the special non-temporal memory instructions are used.

But look closer at the code:

static void set_thread_state_v1(const char* state)
  ...
    int state_len= state ? strlen(state) : 0;
    pfs->m_processlist_lock.allocated_to_dirty();
    pfs->m_processlist_state_ptr= state;
    pfs->m_processlist_state_length= state_len;
    pfs->m_processlist_lock.dirty_to_allocated();

So it is storing a pointer to a string along with the length, and it needs to do synchronisation to ensure that the two are consistent with each other. But suppose instead that we store only the pointer. This can be done atomically without _any_ need for memory barriers or other synchronisation (as long as the memory location is naturally aligned). That would seem to be even better than using the (cheaper) write- and read-barriers.

To get a better idea of how bad the use of full memory barriers is for performance, I wrote a quick and dirty test program. It has code similar to what is in set_thread_state_v1(), as well as three other variants of it: One using just a write memory barrier, one using no memory barrier (not SMP safe, but useful for comparison), and one using normal pthread mutexes rather than a lock-free algorithm. Here is the time taken for 100,000,000 iterations of the function:

	wallclock	cycles	cycles overhead
Original	3.70	81	71
Write barrier	0.53	12	2
No barrier	0.45	10	–
Pthread mutex	2.26	50	40

Note how expensive the full memory barrier version is. The overhead (compared to no barrier) is 35 times larger than the lesser write-barrier. And I suspect that the real overhead is even bigger in real-life than in this synthetic micro-benchmark (for example imagine if we get a cache miss on one of those locked instructions).

Note the sad thing here that the “clever” lock-free algorithm in this case actually makes things slower than just using a simple pthread mutex. The overhead of the original code is 75% higher than the mutex version. This is not really surprising: the main cost of a mutex operations is also just a single full memory barrier, but pthread mutexes have probably been very carefully tuned and optimised by some very knowledgeable people, unlike the ad-hoc performance schema code. Lock-free algorithms require a lot of knowledge and care to do correctly, it is not simple stuff.

I think there is an important lesson here about our use of synchronisation primitives in MariaDB/MySQL. While there is a lot of focus on mutex contention issues, and some various parts starting to use lock-free algorithms, there does not seem to be a general understanding of issues around memory barriers among the developers. In fact to my knowledge there is no general facilities for explicit memory barriers. We have my_atomic.h, which confuses the issues of “atomicity” and “memory barriers” and defines a handful of “atomic” operations that are really locked instructions implying a full memory barrier, as in the example above. As the test program shows, it is really important to distinguish between different kinds of memory barriers and understand the performance implications of the use of each. I have always been really unhappy about my_atomic.h, and I think this is an area that we need to improve significantly in.

And who knows, maybe the performance schema developers can take a hint or two from this.

17 comments

(Anonymous) says:

November 28, 2013 at 13:12

Excellent! Thanks for sharing this!

Rgds,
-Dimitri

Reply
kostja_osipov says:

November 28, 2013 at 18:07

Thanks a lot for sharing! I believe the locked instructions are created by gcc __sync_ builtins, which, really, are just atomic ops according to the gcc manual.
Did you look at the latest C++ standard atomic primitives? Maybe they produce better x86_64 assembly?

Reply
1. knielsen says:
  
  November 28, 2013 at 21:37
  
  “atomic op” – to me, this means something is indivisible. So the operation can never be seen from other threads as being half-done. Nor can one part of the operation see a different state of other threads than another part of the operation.
  
  On most database-grade CPUs, any aligned load or store is atomic automatically. The compare_and_swap is special in that it combines a load and a store in one indivisible operation, I suppose that is the reason for the use of the term “atomic”.
  
  The __sync_ builtins are in addition _synchronised_ – they do a full memory barrier. This is a different concept from atomicity.
  
  I did not look at the C++ atomics before, but I took a quick look now. I am not familiar with the acquire/release terminology, but it appears to me that an “acquire” means a read barrier, and a “release” means a write barrier, in which case they look equivalent.
  
  But they still seem to mix up the operation with the barrier. The reality is that we need the ability to output an “sfence”, “lfence”, or “mfence” instruction. Lock-free stuff is plenty complicated already without adding unnecessary abstractions on top…
  
  Reply
  1. swanhart says:
    
    November 29, 2013 at 20:23
    
    http://en.wikipedia.org/wiki/Linearizability
    
    Reply
Antony T Curtis says:

November 29, 2013 at 14:03

The Audit plugins API was originally designed to be relatively lightweight on synchronisation even though it permits installing and removing plugins at run time otherwise it would have measurably negatively impacted performance.
It is unfortunate that many green CS graduates today are still being taught (by implication) the falsehood that atomic operations have no additional cost.

Reply
honeyman says:

November 29, 2013 at 22:18

Hello.

I might have missed something (I don’t c), but… is your benchmark indeed testing the performance of thread synchronization primitives, running a single thread on a single CPU?

Reply
1. honeyman says:
  
  November 29, 2013 at 23:41
  
  I haven’t noticed any mention of “replication”, so still believe that the tested topic is the single-thread access to the regular instance of MySQL/MariaDB (at least, that’s what the sysbench –num-threads=1 –test=oltp suggests).
  
  Well, it seems normal that a regular modern RDBMS is most likely optimized to increase throughput for multi-client access, at the loss of sheer single-threaded performance. Scalability overhead, you know.
  
  But what catches the eye, is the following sentence: Note the sad thing here that the “clever” lock-free algorithm in this case actually makes things slower than just using a simple pthread mutex.. From the “quick and dirty test program”, it is hard to realize, is it just the single-thread performance where the “clever lock-free algorithm” fails miserably (so barely anyone should care of it), or does it fail for its target purpose as well, during the high-concurrency access to the shared data?
  
  Reply
  1. knielsen says:
    
    November 30, 2013 at 04:04
    
    > is your benchmark indeed testing the performance of thread synchronization
    > primitives, running a single thread on a single CPU?
    
    Yes. I wanted to get a rough idea how expensive a full memory barrier is, compared to a lesser write barrier, in terms of CPU cycles spent.
    
    > or does it fail for its target purpose as well, during the high-concurrency
    > access to the shared data?
    
    Well, what makes you think the target purpose is “high-concurrency access to the shared data”?
    
    As I wrote, I am not intimately familiar with the performance schema code, but it looks to me as if this code is anything but intended for high-concurrency access. It looks to me as if this is data that can only be written by a single thread; is written very often (once at the beginning of each stage during every query execution); and only read rarely (presumably if the DBA wants to inspect what threads are doing at some point in time).
    
    Thus, my guess is that the “clever” algorithm here was exactly intended to optimise the single-threaded performance in the common case where there is no concurrent data access, while still preserving the correctness of such concurrent access where it does occur. And I was pointing out a couple of ways that this could be done much better.
    
    Also note that single-threaded performance is also multi-threaded performance! With multiple threads, each thread is still limited by the speed of CPU execution. And the overhead of synchronisation primitives becomes more important in the multi-threaded case, not less, because of the need to synchronise CPU caches over the shared inter-CPU memory bus.
    
    Reply
Jonas Oreland says:

December 2, 2013 at 07:55

Hi Kristian,

disclaimer: i haven’t checked code, only read blog post.

From what I can see, it looks like an attempted seqlock (http://en.wikipedia.org/wiki/Seqlock).
I wrote a one for ndb, that you can find as NdbSeqLock.hpp (if you’re interested).

Using that as base, and falling back to e.g pthread_rwlock_t should be easy…i think…

/Jonas

Reply
Jonas Oreland says:

December 12, 2013 at 08:05

Hi Kristian,

disclaimer: i haven’t checked code, only read blog post.

From what I can see, it looks like an attempted seqlock (fully described in wikipedia)
I wrote a one for ndb, that you can find as NdbSeqLock.hpp (if you’re interested).
Using that as base, and falling back to e.g pthread_rwlock_t should be easy…i think…

Have you come to the same conclusion ?

/Jonas

ps. this is attempt 2 of this comment, the first got rejected as spam (i just discovered)

Reply
1. knielsen says:
  
  December 12, 2013 at 08:55
  
  Jonas, I think you are right that this is similar to a seqlock. And checking the code, indeed one sees that it uses just rmb() / wmb(), not a full barrier. So yes, I agree, the NDB code is good, the performance schema code not so much…
  
  (And I should move my blog to a better place, but haven’t gotten around to it yet.)
  
  Reply
  1. Jonas Oreland says:
    
    December 16, 2013 at 11:08
    
    do you want me to propose a patch, or are you working on it ?
    
    Reply
    1. knielsen says:
      
      December 16, 2013 at 11:28
      
      I am not working on it. So you are welcome to propose a patch.
      
      Reply
      1. Jonas Oreland says:
        
        December 19, 2013 at 08:52
        
        Hi again Kristian,
        
        I finally took the time to write a small quilt-series (https://sites.google.com/site/jonasdownloadstuff/pfs_lock.tgz?attredirects=0&d=1) for the pfs_lock stuff.
        You’re most welcome to test and see if it makes any noticeable improvements (i haven’t)…
        
        /Jonas
      2. Jonas Oreland says:
        
        December 19, 2013 at 08:53
        
        Hi again Kristian,
        
        I finally took the time to write a small quilt-series for the pfs_lock stuff.
        It can be found on jonasoreland.blogspot.se (i can’t insert a link, or livejournal will consider this post spam).
        
        You’re most welcome to test and see if it makes any noticeable improvements (i haven’t)…
        
        /Jonas
      3. Jonas Oreland says:
        
        December 19, 2013 at 08:54
        
        Hi again Kristian,
        
        I finally took the time to write a small quilt-series for the pfs_lock stuff.
        It can be found on my blog (i can’t insert a link, or livejournal will consider this post spam).
        
        You’re most welcome to test and see if it makes any noticeable improvements (i haven’t)…
        
        /Jonas
Sergey Vojtovich says:

February 28, 2014 at 08:13

Add to this the fact that my_atomic_load32() does memory write which may also contribute quite a few cycles.

IIUC load can be optimized away if we would use atomic_add instead. E.g. allocated_to_dirty() in fact does “m_version_state–” and dirty_to_allocated() something like “m_version_state+= 5”.

And even that seem to be excessive: all it needs is single atomic store of state (either null-terminated string or a pointer to str+len structure) + write barrier.

Reply

17 comments

Leave a Reply to knielsen Cancel reply