BIT-aligned storage – Kristian Nielsen's blog

I have been working on a set of efficient C++ classes for storing N-bit (N<=64) values at arbitrary bit offsets into a buffer. Essentially a way to address memory bit-wise, rather than the usual byte-wise or word-wise. The classes support either fixed-sized bitfield storage (eg. say 27-bit values), or compressed values where small numbers are stored in fewer bits than large numbers (for example 0-15 are stored in 6 bits, 16-2**24 are stored in 26 bits, and so on).

The basic idea is of course to save space when storing large amounts of data eg. for a relational database. Today most systems tend to pad much of their numbers to 32 or 64 or whatever bits, wasting space. This is done to allow efficient access, as it is much faster to access data when it is sized and aligned to one of the word sizes supported by the machine.

Or is it really?

Todays CPUs are so fast compared to I/O (disk, network) and even to main memory that this may be about to change. In fact my experiments show that already today, there are cases where working with large data sets in-memory is as fast or faster when stored at arbitrary bit offset rather than padded to machine word size. A very interesting, and surprising, result.

For the test, I created a list containing the sizes (in bytes) of each file on my harddisk. So I consider this “real world” data. The biggest file is >4GB, so this would normally be considered a 64-bit data set.

The reference code loads the data into an array of 2,000,000 64-bit words (the list is repeated as necessary to reach a total of 16MB, which exceeds the CPU L2 cache of 2MB). It measures how long it takes to execute this loop:

  uint64_t sum= 0;
  for (uint j= 0; j < 1000; j++)
    for (uint i= 0; i < 2000000; i++)
      sum+= b[i];
    }
  }

We just load and add up all the values, so the actual data processing here is minimal (just an add instruction), since we want to focus on the cost of loading the data in different ways.

I implemented, optimised, and tested a number of different implementations of reading arbitrary-sized bitfields and compressed numbers. This one consistently performed the best:

    class pack_ptr {
      const uint64_t * const base;
      uint64_t v;
      uint64_t i;
      uint64_t bitsleft;
     public:
      pack_ptr(const uint64_t *_base) :
          base(_base), v(0), i(0), bitsleft(0) { }
      uint64_t getbits(uint64_t numbits)
      {
        if (bitsleft >= numbits)
        {
          uint64_t r= v & (~(uint64_t)0 >> (64 - numbits));
          v>>= numbits;
          bitsleft-= numbits;
          return r;
        } else {
          uint64_t r= v;
          v= base[i++];
          r= (r | (v << bitsleft)) & (~(uint64_t)0 >> (64 - numbits));
          if (likely(!(numbits == 64 && bitsleft == 0)))
            v>>= numbits - bitsleft;
          else
            v= 0;
          bitsleft= bitsleft + (64 - numbits);
          return r;
        }
      }
      uint64_t unpack1()
      {
        uint64_t n= getbits(3);
        return getbits(n*9+1);
      }
    };

The code is pretty straight-forward. The main tricky thing is to ensure that we never execute a shift by 64 bit, as that is undefined in C/C++. The condition numbits == 64 is not strictly necessary, but it makes it possible for the compiler to eliminate the conditional completely when the value of numbits is fixed, and reduces the risk of CPU misprediction of the conditional when it cannot be eliminated. The compressed values here are stored as an initial 3 bits of SIZE, followed by SIZE*9 + 1 bits of the real value. So 0/1 will be stored in 4 bits, 2-1023 will be stored in 13 bits, and so on. The test number list is compressed down to 29% of the original 16MB (but still does not fit the CPU cache).

Here then are the loops for fixed-width bitfields and compressed values:

  uint64_t sum= 0;
  for (uint j= 1000; j > 0; j--)
  {
    pack_ptr p(base);
    for (uint i= 2000000; i > 0; i--)
      sum+= p.getbits(33);
  }

  uint64_t sum= 0;
  for (uint j= 1000; j > 0; j--)
  {
    pack_ptr p(base);
    for (uint i= 2000000; i > 0; i--)
      sum+= p.unpack1();
  }

And here are the results, times in seconds for 1,000 loops of adding 2,000,000 numbers (a total of 2 billion values processed):

	Aligned 64-bit words	Fixed-size bitfield	Packed values
Laptop (Core 2 Duo@2.4GHz)	4.32s	4.51s	10.38s
– with two threads	7.93s	5.19s	10.48s
Desktop (Core 2 Quad@2.4GHz)	2.96s	4.32s	10.27s
– with two threads	4.53s	4.38s	10.27s
– with four threads	9.63s	5.21s	10.36s

Time to process list (lower is better)

The test is run on two machines, one a dual-core laptop and the other a quad-core desktop machine. As can be seen, the desktop has a somewhat better memory system, so the maximum throughput is higher. On the laptop, even with only one thread the fixed-size bitfield is only 5% slower than aligned 64-bit words. And with two threads it is faster (on both machines) since main memory bandwidth becomes the bottleneck. On the quad-core with all 4 cores running, even the packed values are only 8% slower than aligned 64-bit access!

I find the numbers very interesting. Even when reading from main memory, bit-packing data like this has very modest overhead (sometimes it is even faster), and very respectable compression ratios (down to 52% and 29% of original size, respectively). For sending the data to disk or over network, bit-aligned access can be a big win. And as CPU speed continue to improve faster than memory and I/O speed, this will only get more interesting (6 and 8 core commodity CPUs are on their way).

The code (still a somewhat rough prototype) is available for anyone interested on repo.or.cz. It is licensed under the GPL.

The code generated by GCC 4.2.3 for the fixed-size bitfield loop is quite illuminating:

.L142:
        movq    %r12, %r8
        subq    $33, %r10
        shrq    $33, %r12
        andq    %rdi, %r8
        addq    %r8, %rdx
        subl    $1, %r9d
        je      .L133
.L135:
        cmpq    $32, %r10
        ja      .L142

        movq    (%rbx,%r11,8), %rax
        movl    %r10d, %ecx
        addq    $1, %r11
        movq    %rax, %r8
        salq    %cl, %r8
        movl    $33, %ecx
        orq     %r12, %r8
        subl    %r10d, %ecx
        movq    %rax, %r12
        andq    %rdi, %r8
        shrq    %cl, %r12
        addq    $31, %r10
        addq    %r8, %rdx
        subl    $1, %r9d
        jne     .L135

(note that GCC puts the loop entry at .L135). That is just 9 instructions for the common case where the value to be fetched fits in a single 64-bit word, and 17 instructions for the case where the next word needs to be fetched.

The CPU (Intel Core 2 Duo) is able to extract and process one fixed-size bitfield in around 5 cycles, and one packed value in around 13 cycles. For a 2.4GHz CPU using 64-bit values, that is 3.84 respectively 1.48 GByte/sec per CPU core, or 15.36 / 5.91 GByte/sec on a quad-core Q6600 CPU. That is approaching or exceeding the bandwidth of many memory subsystems even in modern computers. I am very impressed with the technology of both GCC and Intel Core 2 Duo/Quad here, the former generating near perfect code and the latter executing around 2.5 instructions per clock on average in my benchmarks (that is 10 billion instructions per second total on the quad-core costing <200 euro!).

In a later posting I plan to describe the putbits() and pack1() methods, and investigate the performance of several different alternative implementations (initial tests show that putbits() and pack1() are of the same level performance-wise as getbits() and unpack1()).

3 comments

jobinau says:

March 15, 2010 at 12:06

Aligned data

Thanks for publishing such an article.
first time i am seeing such a one.
I have a question here.
Data in your Comparison table looks like, when data Aligned 64-bit words, it shows almost linear scalability against the number of threads.
while packed data is not scaling well with number of threads.
is that so?

1. knielsen says:
  
  March 15, 2010 at 13:17
  
  Re: Aligned data
  
  Each number in the table is the wall-clock time for N threads running in parallel to process N*2e9 elements.
  
  So linear scalability is indicated by having the same number for N and 2*N threads. No scalability is indicated by having twice as long time for 2*N threads as for N threads.
  
  I realise now that this is highly confusing. Sorry!
  
  So in terms of scalability, actually the numbers show that processing unaligned data scales well, while aligned scales poorly.
  
  But in fact the numbers are about memory bandwidth bottleneck rather than scalability. The aligned data is bigger, so it hits the bottleneck at fewer elements/second. The unaligned data is smaller, so can go to higher elements/second before hitting the bottleneck.
  
  The interesting thing here is how many instructions we can spend per word of data (for packed values it’s like 30-40 instructions, including several conditional branches!), and still be within the range of memory bandwidth limits. Instruction execution has just become amazingly cheap compared to memory accesses on modern high-performance CPUs.
  
  1. jobinau says:
    
    March 17, 2010 at 06:51
    
    Re: Aligned data
    
    Thank you very much!.
    Very well explained. i am realizing the truth..
    “small increase in code size is ok compared memory fetches”
    Probably we may have to test the case where code itself will not fit into L2.
    
    Beautiful is the way you tested and even analyzing the code GCC generated.
    Thanks a lot for this article.

3 comments

Leave a comment Cancel reply