Dynamic linking costs two cycles

It turns out that the overhead of dynamic linking on Linux amd64 is 2 CPU cycles per cross-module call. I usually take forever to get to the point in my writing, so I thought I would change this for once 🙂

In MySQL, there has been a historical tendency to favour static linking, in
part because to avoid the overhead (in execution efficiency) associated with
dynamic linking. However, on modern systems there are also very serious
drawbacks when using static linking.

The particular issue that inspired this article is that I was working on MWL#74, building a proper shared libmysqld.so library for the MariaDB embedded server. The lack of a proper libmysqld.so in MySQL and MariaDB has caused no end of grief for packaging Amarok for the various Linux distributions. My patch increases the amount of dynamic linking (in a default build), so I did a quick test to get an idea of the overhead of this.

ELF dynamic linking

The overhead comes from the way dynamic linking works in ELF-based systems like Linux (and many other POSIX-like operating systems). Code in shared libraries must be compiled to be position-independent, achieved with the -fPIC compiler switch. This allows the loader to simply mmap() the image of a shared library into the process address space at whatever free memory space is available, and the code can run without any need for the loader to do any kind of relocations of the code. For a much more detailed explanation see for example this article.

When generating position-independent code for a function call into another shared object, the compiler cannot generate a simple absolute call instruction, as the destination address is not known until run-time. Instead, the call goes via an indirect jump is generated, fetching the destination address from a table called the PLT, short for Procedure Linkage Table. For example:

                       callq  0x400680 <mylib_myfunc@plt>)
...
<mylib_myfunc@plt>:    jmpq   *0x200582(%rip)

The indirect jump resolves at runtime into the address of the real function to be called, so that is the overhead of the call when using dynamic linking: one indirect jump instruction.

Micro-benchmarking

To measure this one-instruction overhead in terms of execution time, I used
the following code:

    for (i= 0; i < count; i++)
      v= mylib_myfunc(v);

The function mylib_myfunc() is placed in a library, with the
following code:

    int mylib_myfunc(int v) {return v+1;}

I tested this with both static and dynamic linking on a Core 2 Duo 2.4 GHz machine running Linux amd64. Here are the results from running the loop for 1,000,000,000 (one billion) operations:

 total time (sec.)CPU cycles/iteration
Static linking2.546
Dynamic linking3.388

So that is the two CPU cycles of overhead per call that I referred to at the start of this post.

Incidentally, if you try stepping through the call with a debugger, you will see a much larger overhead for the very first call. Do not be fooled by this, this is just because the loader fills in the PLT lazily, computing the correct address of the destination only on the first time the call is made (so addresses of functions that are never called by a process need never be calculated). See above-referenced article for more details.

(Note that this is for 64-bit amd64. For 32-bit x86, the mechanism is similar, but the actual overhead may be somewhat larger, since that architecture lacks program-counter-relative addressing and so must reserve one register %ebx (out of its already quite limited register bank) for this purpose. I did not measure the 32-bit case, I think it is of little interest nowadays for high-performance MySQL or MariaDB deployments (and the overhead of function calls on x86 32-bit is significantly higher anyway, dynamic linking or not, due to the need to push and pop all arguments to/from the stack)).

Conclusion

Two cycles per call is, in my opinion, a very modest overhead. It is hard to imagine high-performance code where this will have a real-life noticeable effect. Modern systems rely heavily on dynamic linking, and static linking is nowadays causing much more problems that it solves. And I think it is also time to put the efficiency argument for static linking to rest.

3 comments

  1. You never got around to elaborating on this part:

    “However, on modern systems there are also very serious drawbacks when using static linking”

    P.S. please don’t talk about packaging advantages — it’s bullshit.

    1. It is mostly mixing of static and dynamic linking that is a problem.

      One big issue is when you link library A statically, and also link library B
      dynamically, and B itself uses A. Then you end up with parts of A from the
      static linking of the program and other parts of A from the dynamic linking of
      B. This will cause a lot of grief unless the two versions of A are exactly the
      same.

      If the program is _completely_ statically linked then this problem does not
      occur, but that is becoming increasingly difficult to ensure.

      Just doing a hostname lookup with glibc will cause dynamic loading of
      different resolver libraries, which will crash if the static version of libc
      is different from the dynamic one. MySQL/MariaDB, like many other larger
      applications, support dynamic loading of plugins, which involves dynamic
      linking.

Leave a Reply to (Anonymous) Cancel reply

Your email address will not be published. Required fields are marked *