{"id":157,"date":"2010-10-03T18:55:00","date_gmt":"2010-10-03T18:55:00","guid":{"rendered":"https:\/\/knielsen-hq.org\/w\/?p=157"},"modified":"2021-09-01T14:40:43","modified_gmt":"2021-09-01T14:40:43","slug":"dynamic-linking-costs-two-cycles","status":"publish","type":"post","link":"https:\/\/knielsen-hq.org\/w\/dynamic-linking-costs-two-cycles\/","title":{"rendered":"Dynamic linking costs two cycles"},"content":{"rendered":"\n<p>It turns out that the overhead of dynamic linking on Linux amd64 is 2 CPU cycles per cross-module call. I usually take forever to get to the point in my writing, so I thought I would change this for once \ud83d\ude42<\/p>\n\n\n\n<p>In MySQL, there has been a historical tendency to favour static linking, in<br>part because to avoid the overhead (in execution efficiency) associated with<br>dynamic linking. However, on modern systems there are also very serious<br>drawbacks when using static linking.<\/p>\n\n\n\n<p>The particular issue that inspired this article is that I was working on <a href=\"http:\/\/askmonty.org\/worklog\/Server-Sprint\/?tid=74\">MWL#74<\/a>, building a proper shared <code>libmysqld.so<\/code> library for the MariaDB embedded server. The lack of a proper <code>libmysqld.so<\/code> in MySQL and MariaDB has caused no end of grief for packaging <a href=\"http:\/\/amarok.kde.org\/\">Amarok<\/a> for the various Linux distributions. My patch increases the amount of dynamic linking (in a default build), so I did a quick test to get an idea of the overhead of this.<\/p>\n\n\n\n<h2>ELF dynamic linking<\/h2>\n\n\n\n<p>The overhead comes from the way dynamic linking works in ELF-based systems like Linux (and many other POSIX-like operating systems). Code in shared libraries must be compiled to be position-independent, achieved with the <code>-fPIC<\/code> compiler switch. This allows the loader to simply <code>mmap()<\/code> the image of a shared library into the process address space at whatever free memory space is available, and the code can run without any need for the loader to do any kind of relocations of the code. For a much more detailed explanation see for example <a href=\"http:\/\/www.symantec.com\/connect\/es\/articles\/dynamic-linking-linux-and-windows-part-one\">this article<\/a>.<\/p>\n\n\n\n<p>When generating position-independent code for a function call into another shared object, the compiler cannot generate a simple absolute <code>call<\/code> instruction, as the destination address is not known until run-time. Instead, the call goes via an indirect jump is generated, fetching the destination address from a table called the PLT, short for Procedure Linkage Table. For example:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">                       callq  0x400680 &lt;mylib_myfunc@plt&gt;)\n...\n&lt;mylib_myfunc@plt&gt;:    jmpq   *0x200582(%rip)\n<\/pre>\n\n\n\n<p>The indirect jump resolves at runtime into the address of the real function to be called, so that is the overhead of the call when using dynamic linking: one indirect jump instruction.<\/p>\n\n\n\n<h2>Micro-benchmarking<\/h2>\n\n\n\n<p>To measure this one-instruction overhead in terms of execution time, I used<br>the following code:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">    for (i= 0; i &lt; count; i++)\n      v= mylib_myfunc(v);<\/pre>\n\n\n\n<p>The function <code>mylib_myfunc()<\/code> is placed in a library, with the<br>following code:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">    int mylib_myfunc(int v) {return v+1;}<\/pre>\n\n\n\n<p>I tested this with both static and dynamic linking on a Core 2 Duo 2.4 GHz machine running Linux amd64. Here are the results from running the loop for 1,000,000,000 (one billion) operations:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><th>&nbsp;<\/th><th>total time (sec.)<\/th><th>CPU cycles\/iteration<\/th><\/tr><tr><th>Static linking<\/th><td>2.54<\/td><td>6<\/td><\/tr><tr><th>Dynamic linking<\/th><td>3.38<\/td><td>8<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>So that is the two CPU cycles of overhead per call that I referred to at the start of this post.<\/p>\n\n\n\n<p>Incidentally, if you try stepping through the call with a debugger, you will see a much larger overhead for the very first call. Do not be fooled by this, this is just because the loader fills in the PLT lazily, computing the correct address of the destination only on the first time the call is made (so addresses of functions that are never called by a process need never be calculated). See above-referenced article for more details.<\/p>\n\n\n\n<p>(Note that this is for 64-bit amd64. For 32-bit x86, the mechanism is similar, but the actual overhead may be somewhat larger, since that architecture lacks program-counter-relative addressing and so must reserve one register <code>%ebx<\/code> (out of its already quite limited register bank) for this purpose. I did not measure the 32-bit case, I think it is of little interest nowadays for high-performance MySQL or MariaDB deployments (and the overhead of function calls on x86 32-bit is significantly higher anyway, dynamic linking or not, due to the need to push and pop all arguments to\/from the stack)).<\/p>\n\n\n\n<h2>Conclusion<\/h2>\n\n\n\n<p>Two cycles per call is, in my opinion, a very modest overhead. It is hard to imagine high-performance code where this will have a real-life noticeable effect. Modern systems rely heavily on dynamic linking, and static linking is nowadays causing much more problems that it solves. And I think it is also time to put the efficiency argument for static linking to rest.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It turns out that the overhead of dynamic linking on Linux amd64 is 2 CPU cycles per cross-module call. I usually take forever to get to the point in my writing, so I thought I would change this for once \ud83d\ude42 In MySQL, there has been a historical tendency to favour static linking, inpart because&hellip; <a class=\"more-link\" href=\"https:\/\/knielsen-hq.org\/w\/dynamic-linking-costs-two-cycles\/\">Continue reading <span class=\"screen-reader-text\">Dynamic linking costs two cycles<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[3,35,4,46,6],"_links":{"self":[{"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/posts\/157"}],"collection":[{"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/comments?post=157"}],"version-history":[{"count":3,"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/posts\/157\/revisions"}],"predecessor-version":[{"id":160,"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/posts\/157\/revisions\/160"}],"wp:attachment":[{"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/media?parent=157"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/categories?post=157"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/tags?post=157"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}