{"id":58,"date":"2008-06-27T12:47:00","date_gmt":"2008-06-27T12:47:00","guid":{"rendered":"https:\/\/knielsen-hq.org\/w\/?p=58"},"modified":"2021-09-01T07:14:48","modified_gmt":"2021-09-01T07:14:48","slug":"bit-aligned-storage","status":"publish","type":"post","link":"https:\/\/knielsen-hq.org\/w\/bit-aligned-storage\/","title":{"rendered":"BIT-aligned storage"},"content":{"rendered":"\n<p>I have been working on a set of efficient C++ classes for storing N-bit (N&lt;=64) values at arbitrary <em>bit<\/em> offsets into a buffer. Essentially a way to address memory bit-wise, rather than the usual byte-wise or word-wise. The classes support either fixed-sized bitfield storage (eg. say 27-bit values), or compressed values where small numbers are stored in fewer bits than large numbers (for example 0-15 are stored in 6 bits, 16-2**24 are stored in 26 bits, and so on).<\/p>\n\n\n\n<p>The basic idea is of course to save space when storing large amounts of data eg. for a relational database. Today most systems tend to pad much of their numbers to 32 or 64 or whatever bits, wasting space. This is done to allow efficient access, as it is much faster to access data when it is sized and aligned to one of the word sizes supported by the machine.<\/p>\n\n\n\n<p>Or is it really?<\/p>\n\n\n\n<p>Todays CPUs are so fast compared to I\/O (disk, network) and even to main memory that this may be about to change. In fact my experiments show that already today, there are cases where working with large data sets in-memory is as fast or faster when stored at arbitrary bit offset rather than padded to machine word size. A very interesting, and surprising, result.<\/p>\n\n\n\n<p>For the test, I created a list containing the sizes (in bytes) of each file on my harddisk. So I consider this &#8220;real world&#8221; data. The biggest file is &gt;4GB, so this would normally be considered a 64-bit data set.<\/p>\n\n\n\n<p>The reference code loads the data into an array of 2,000,000 64-bit words (the list is repeated as necessary to reach a total of 16MB, which exceeds the CPU L2 cache of 2MB). It measures how long it takes to execute this loop:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">  uint64_t sum= 0;\n  for (uint j= 0; j &lt; 1000; j++)\n    for (uint i= 0; i &lt; 2000000; i++)\n      sum+= b[i];\n    }\n  }\n<\/pre>\n\n\n\n<p>We just load and add up all the values, so the actual data processing here is minimal (just an add instruction), since we want to focus on the cost of loading the data in different ways.<\/p>\n\n\n\n<p>I implemented, optimised, and tested a number of different implementations of reading arbitrary-sized bitfields and compressed numbers. This one consistently performed the best:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">    class pack_ptr {\n      const uint64_t * const base;\n      uint64_t v;\n      uint64_t i;\n      uint64_t bitsleft;\n     public:\n      pack_ptr(const uint64_t *_base) :\n          base(_base), v(0), i(0), bitsleft(0) { }\n      uint64_t getbits(uint64_t numbits)\n      {\n        if (bitsleft &gt;= numbits)\n        {\n          uint64_t r= v &amp; (~(uint64_t)0 &gt;&gt; (64 - numbits));\n          v&gt;&gt;= numbits;\n          bitsleft-= numbits;\n          return r;\n        } else {\n          uint64_t r= v;\n          v= base[i++];\n          r= (r | (v &lt;&lt; bitsleft)) &amp; (~(uint64_t)0 &gt;&gt; (64 - numbits));\n          if (likely(!(numbits == 64 &amp;&amp; bitsleft == 0)))\n            v&gt;&gt;= numbits - bitsleft;\n          else\n            v= 0;\n          bitsleft= bitsleft + (64 - numbits);\n          return r;\n        }\n      }\n      uint64_t unpack1()\n      {\n        uint64_t n= getbits(3);\n        return getbits(n*9+1);\n      }\n    };\n<\/pre>\n\n\n\n<p>The code is pretty straight-forward. The main tricky thing is to ensure that we never execute a shift by 64 bit, as that is undefined in C\/C++. The condition <code>numbits == 64<\/code> is not strictly necessary, but it makes it possible for the compiler to eliminate the conditional completely when the value of <code>numbits<\/code> is fixed, and reduces the risk of CPU misprediction of the conditional when it cannot be eliminated. The compressed values here are stored as an initial 3 bits of SIZE, followed by SIZE*9 + 1 bits of the real value. So 0\/1 will be stored in 4 bits, 2-1023 will be stored in 13 bits, and so on. The test number list is compressed down to 29% of the original 16MB (but still does not fit the CPU cache).<\/p>\n\n\n\n<p>Here then are the loops for fixed-width bitfields and compressed values:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">  uint64_t sum= 0;\n  for (uint j= 1000; j &gt; 0; j--)\n  {\n    pack_ptr p(base);\n    for (uint i= 2000000; i &gt; 0; i--)\n      sum+= p.getbits(33);\n  }\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">  uint64_t sum= 0;\n  for (uint j= 1000; j &gt; 0; j--)\n  {\n    pack_ptr p(base);\n    for (uint i= 2000000; i &gt; 0; i--)\n      sum+= p.unpack1();\n  }\n<\/pre>\n\n\n\n<p>And here are the results, times in seconds for 1,000 loops of adding 2,000,000 numbers (a total of 2 billion values processed):<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><th>&nbsp;<\/th><th>Aligned 64-bit words<\/th><th>Fixed-size bitfield<\/th><th>Packed values<\/th><\/tr><tr><th>Laptop&nbsp;(Core&nbsp;2&nbsp;Duo@2.4GHz)<\/th><td>4.32s<\/td><td>4.51s<\/td><td>10.38s<\/td><\/tr><tr><th>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#8211;&nbsp;with&nbsp;two&nbsp;threads<\/th><td>7.93s<\/td><td>5.19s<\/td><td>10.48s<\/td><\/tr><tr><th>Desktop&nbsp;(Core&nbsp;2&nbsp;Quad@2.4GHz)<\/th><td>2.96s<\/td><td>4.32s<\/td><td>10.27s<\/td><\/tr><tr><th>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#8211;&nbsp;with&nbsp;two&nbsp;threads<\/th><td>4.53s<\/td><td>4.38s<\/td><td>10.27s<\/td><\/tr><tr><th>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#8211;&nbsp;with&nbsp;four&nbsp;threads<\/th><td>9.63s<\/td><td>5.21s<\/td><td>10.36s<\/td><\/tr><\/tbody><\/table><figcaption>Time to process list (lower is better)<\/figcaption><\/figure>\n\n\n\n<p>The test is run on two machines, one a dual-core laptop and the other a quad-core desktop machine. As can be seen, the desktop has a somewhat better memory system, so the maximum throughput is higher. On the laptop, even with only one thread the fixed-size bitfield is only 5% slower than aligned 64-bit words. And with two threads it is faster (on both machines) since main memory bandwidth becomes the bottleneck. On the quad-core with all 4 cores running, even the packed values are only 8% slower than aligned 64-bit access!<\/p>\n\n\n\n<p>I find the numbers very interesting. Even when reading from main memory, bit-packing data like this has very modest overhead (sometimes it is even faster), and very respectable compression ratios (down to 52% and 29% of original size, respectively). For sending the data to disk or over network, bit-aligned access can be a big win. And as CPU speed continue to improve faster than memory and I\/O speed, this will only get more interesting (6 and 8 core commodity CPUs are on their way).<\/p>\n\n\n\n<p>The code (still a somewhat rough prototype) is available for anyone interested on <a href=\"http:\/\/repo.or.cz\/w\/beedb.git\">repo.or.cz<\/a>. It is licensed under the GPL.<\/p>\n\n\n\n<p>The code generated by GCC 4.2.3 for the fixed-size bitfield loop is quite illuminating:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">.L142:\n        movq    %r12, %r8\n        subq    $33, %r10\n        shrq    $33, %r12\n        andq    %rdi, %r8\n        addq    %r8, %rdx\n        subl    $1, %r9d\n        je      .L133\n.L135:\n        cmpq    $32, %r10\n        ja      .L142\n\n        movq    (%rbx,%r11,8), %rax\n        movl    %r10d, %ecx\n        addq    $1, %r11\n        movq    %rax, %r8\n        salq    %cl, %r8\n        movl    $33, %ecx\n        orq     %r12, %r8\n        subl    %r10d, %ecx\n        movq    %rax, %r12\n        andq    %rdi, %r8\n        shrq    %cl, %r12\n        addq    $31, %r10\n        addq    %r8, %rdx\n        subl    $1, %r9d\n        jne     .L135\n<\/pre>\n\n\n\n<p>(note that GCC puts the loop entry at <code>.L135<\/code>). That is just 9 instructions for the common case where the value to be fetched fits in a single 64-bit word, and 17 instructions for the case where the next word needs to be fetched.<\/p>\n\n\n\n<p>The CPU (Intel Core 2 Duo) is able to extract and process one fixed-size bitfield in around 5 cycles, and one packed value in around 13 cycles. For a 2.4GHz CPU using 64-bit values, that is 3.84 respectively 1.48 GByte\/sec <em>per CPU core<\/em>, or 15.36 \/ 5.91 GByte\/sec on a quad-core Q6600 CPU. That is approaching or exceeding the bandwidth of many memory subsystems even in modern computers. I am very impressed with the technology of both GCC and Intel Core 2 Duo\/Quad here, the former generating near perfect code and the latter executing around 2.5 instructions per clock on average in my benchmarks (that is 10 billion instructions per second total on the quad-core costing &lt;200 euro!).<\/p>\n\n\n\n<p>In a later posting I plan to describe the <code>putbits()<\/code> and <code>pack1()<\/code> methods, and investigate the performance of several different alternative implementations (initial tests show that <code>putbits()<\/code> and <code>pack1()<\/code> are of the same level performance-wise as <code>getbits()<\/code> and <code>unpack1()<\/code>).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I have been working on a set of efficient C++ classes for storing N-bit (N&lt;=64) values at arbitrary bit offsets into a buffer. Essentially a way to address memory bit-wise, rather than the usual byte-wise or word-wise. The classes support either fixed-sized bitfield storage (eg. say 27-bit values), or compressed values where small numbers are&hellip; <a class=\"more-link\" href=\"https:\/\/knielsen-hq.org\/w\/bit-aligned-storage\/\">Continue reading <span class=\"screen-reader-text\">BIT-aligned storage<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[6],"_links":{"self":[{"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/posts\/58"}],"collection":[{"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/comments?post=58"}],"version-history":[{"count":4,"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/posts\/58\/revisions"}],"predecessor-version":[{"id":62,"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/posts\/58\/revisions\/62"}],"wp:attachment":[{"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/media?parent=58"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/categories?post=58"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/knielsen-hq.org\/w\/wp-json\/wp\/v2\/tags?post=58"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}