(This is a long one, but it does have a point…)
This all starts with the way RubLog does searching (but it isn’t an article about searching, or about bit twiddling. Trust me). Because I eventually want to use cosine-based comparisons to find similar articles, I build vectors mapping word occurrences to each document in the blog. I basically end up with a set of 1,200-1,500 bits for each document. To implement a basic ranked search, I needed to be able to perform a bitwise AND between each of these vectors and a vector containing the search terms, and then count the number of one-bits in the result.
I had a spare 45 minutes last night (I took Zachary to his karate lesson, but there was no room in the parent’s viewing area), so I thought I’d play with this. First I tried storing the bit vectors different ways: it turns out (to my surprise) that storing them as an Array of Fixnums has almost exactly the same speed as storing them as the bits in a Bignum. (Also, rather pleasingly, changing between the two representations involves altering just two lines of code). Because it marshals far more compactly, I decided to stick with the Bignum.
Then I had fun playing with the bit counting itself. The existing code uses the obvious algorithm:
max_bit.times { |i| count += word[i] }
Just how slow was this? I wrote a simple testbed that generated one hundred 1,000 bit vectors, each with 20 random bits set, and timed the loop. For all 100, it took about .4s.
Then I tried using the ‘divide-and-conquer’ counting algorithm from chapter 5 of Hacker’s Delight (modified to deal with chunking the Bignum into 30-bit words).
((max_bit+29)/30).times do |offset|
x = (word >> (offset*30)) & 0x3fffffff
x = x - ((x >> 1) & 0x55555555)
x = (x & 0x33333333) + ((x >> 2) & 0x33333333)
x = (x + (x >> 4)) & 0x0f0f0f0f;
x = x + (x >> 8)
x = x + (x >> 16)
count += x & 0x3f
end
This ran about 2.5 times faster that the naive algorithm.
Then I realized that when I was comparing a set of search times with just a few words in it, the bit vector would be very sparse. Each chunk in the loop above would be likely to be zero. So I added a test:
((max_bit+29)/30).times do |offset|
x = (word >> (offset*30)) & 0x3fffffff
next if x.zero?
x = x - ((x >> 1) & 0x55555555)
x = (x & 0x33333333) + ((x >> 2) & 0x33333333)
x = (x + (x >> 4)) & 0x0f0f0f0f;
x = x + (x >> 8)
x = x + (x >> 16)
count += x & 0x3f
end
Now I was seven times faster than the original. But my testbed was using vectors containing 20 set bits (out of 1,000). I changed it to generate timings with vectors containing 1, 2, 5, 10, 20, 100, and 900 set bits. This got even better: I was up to a factor of 15 faster with 1 or 2 bits in the vector.
But if I could speed things up this much by eliminating zero chunks in the bit-twiddling algorithm could I do the same in the simple counting algorithm? I tried a third algorithm:
((max_bit+29)/30).times do |offset|
x = (word >> (offset*30)) # & 0x3fffffff
next if x.zero?
30.times {|i| count += x[i]}
end
The inner loop here is the same as for the original count, but I now count the bits in chunks of 30.
This code performs identically to the bit-twiddling code for 1 set bit, and only slightly worse for 2 set bits. However. once the number of bits starts to grow (past about 5 for my given chunk size), the performance starts to tail off. At 100 bits it’s slower than the original naive count.
So for my particular application, I could probably chose either of the chunked counting algorithms. Because the bit twiddling one scales up to larger numbers of bits, and because I’ll need that later on if I ever start doing the cosine-based matching of documents, I went with it.
So what’s the point of all this?
Yesterday I posted a blog entry about the importance of verbs. It said "Often the true value of a thing isn’t the thing itself, but instead is the activity that created it." Chad Fowler picked this up and wrote a wonderful piece showing how this was true for musicians. And Brian Marick picked up of Chad’s piece to emphasize the value of practice when learning a creative process.
At the same time, Andy and I had been discussing a set of music tapes he had. Originally designed to help musicians practice scales and arpeggios, these had been so popular that they now encompassed a whole spectrum of practice techniques. We were bemoaning the fact that it seemed unlikely that we’d be able to get developers to do the same: to buy some aid to help them practice programming. We just felt that practicing was not something that programmers did.
Skip forward to this morning. In the shower, I got to thinking about this, and realized that my little 45 minute exploration of bit counting was actually a practice session. I wasn’t really worried about the performance of bit counting in the blog’s search algorithm; in reality it comes back pretty much instantaneously. Instead, I just wanted to play with some code and experiment with a technique I hadn’t used before. I did it in a simple, controlled environment, and I tried many different variations (more than I’ve listed here). And I’ve still got some more playing to do: I want to mess around with the effect of varying the chunk size, and I want to see if any of the other bit counting algorithms perform faster.
What made this a practice session? Well, I had some time without interruptions. I had a simple thing I wanted to try, and I tried it many times. I looked for feedback each time so I could work to improve. There was no pressure: the code was effectively throwaway. It was fun: I kept making small steps forward, which motivated me to continue. Finally, I came out of it knowing more than when I went in.
Ultimately it was having the free time that allowed me to practice. If the pressure was on, if there was a deadline to delivery the blog search functionality, then the existing performance would have been acceptable, and the practice would never have taken place. But those 45 pressure-free minutes let me play.
So how can we do this in the real world? How can we help developers do the practicing that’s clearly necessary in any creative process? I don’t know, but my gut tells me we need to do two main things.
The first is to take the pressure off every now and then. Provide a temporal oasis where it’s OK not to worry about some approaching deadline. It has to be acceptable to relax, because if you aren’t relaxed you’re not going to learn from the practice.
The second is to help folks learn how to play with code: how to make mistakes, how to improvise, how to reflect, and how to measure. This is hard: we’re trained to try to do things right, to play to the score, rather than improvise. I suspect that success here comes only after doing.
So, my challenge for the day: see if you can carve out 45 to 60 minutes to play with a small piece of code. You don’t necessarily have to look at performance: perhaps you could play with the structure, or the memory use, or the interface. In the end it doesn’t matter. Experiment, measure, improve.
Practice.
I like your idea of code Kata. It is something I been doing a lot of lately.
Your bit counting code problem gets used a lot in interview questions. Also it got discussed in Jon Bentley's great book "Programing Perls".
Anyway there is way to get a 10x speed up from your suggested solution. I will put the answer up tomorrow. Hint: think about using pregenerated tables.
Posted by: karl | June 23, 2008 at 10:40 AM
So one way is treat your very large bitfield as an array of bytes. A btye only has 256 possible values. You could make a table called count with 256 values. Each count element is the number of high bits of its index.
count[0] = 0; // 0000
count[1] = 1; // 0001
count[2] = 1; // 0010
count[3] = 2; // 0011
...
count[255] = 8;
You pregenerate the count table.
Then your function can be
int CountHigh( byte[] bitset )
{
int sum = 0;
for( i=0; i sum += count[ bitset[i] ];
return sum;
}
There are few ways to make loop go faster but it mostly depends on the implementing language. One option is to treat is the bitset as an array of word. But then Count has to have 64K elements.
Karl
Posted by: karl | June 25, 2008 at 08:42 PM
Given that the bit vector are very sparse, and that the common "x=0" special case has already been taken care of, the gain from using a lookup table would probably not be worth the amount of code needed for specifying the lookup table and/or the code for generating it.
However, special casing the "1-bit-set" case would be simple:
...
next if x.zero?
{count+=1; next} if (x & (x-1)).zero?
....
(I hope the syntax is right).
This may yield another speed benefit; I haven't done any measurements (which is again the point of the article, I know...). Yet.
Posted by: Erik Søe Sørensen | September 02, 2009 at 08:51 AM
Performance numbers - for Perl:
If the baseline is Hacker's Delight counting method, with 0-bit check:
Adding the 1-bit special case gives 9% better performance for 10% bit saturation,
a 12% time decrease for 20% saturation,
and 4% for 30% bit saturation.
You can even use the "clear the lowest bit" trick exclusively, like this:
while x>0: {count+=1; x = x & (x-1);}
This is substantially faster (15%) than both the naïve and the HD counting method, even with zero-checking.
I'd be interested in hearing numbers for other languages, of course.
(My guess is that in C or similar, it might pay of to not branch too much - so HD with zero-checking would probably be the winner.)
Posted by: Erik Søe Sørensen | September 02, 2009 at 06:53 PM
Sorry, the test setup used above wasn't right... for one thing, the timings included a vector 'and'; for another, the bit densities were accordingly sparse.
New and better numbers:
Bit density | A->B | A->Z | H->J | H->Z | T->Z // Time reduction
5% | 38% | 87% | 23% | 32% | 28%
10% | 13% | 83% | 2% | 16% | -2%
25% | -2% | 72% | -13% | -48% | -94%
50% | -3% | 55% | -17% |-144% | -212%
where
A=naive counting w/ zero case
B=A + 1-bit special case
H=Hacker's Delight bit counting w/ zero case
J=H + 1-bit special case
T=Table version (256 entry table)
Z=counting by repeatedly clearing the lowest bit
(sorry about the formatting. Seems the blog system doesn't support anything as fancy as pre-formatted text.)
Conclusion: the code-wise simple (but somewhat sophisticated) 'Z' algorithm is consistently far better than the naïve 'A'. Perhaps more surprisingly, it even beats 'H' on low bit densities.
But to beat the table-driven algorithm, it needs bit densities below 10%.
Posted by: Erik Søe Sørensen | September 03, 2009 at 02:55 AM
There’re lots of dissertation service or thesis writing service but your great issue as this good post. Therefore, to receive more information you should buy the dissertation structure.
Posted by: KimberleyON23 | January 15, 2010 at 06:19 AM