Author: Matt Taylor
Date: 06:05:30 12/03/02
Go up one level in this thread
On December 03, 2002 at 06:40:36, Gian-Carlo Pascutto wrote:
>On December 02, 2002 at 19:43:15, Gerd Isenberg wrote:
>
>>Congrats, Walter!!
>>
>>10-bit pattern bsf PI2FD btr c LSB_64
>>0x0000000011111133 15.3 18.0 19.1 22.8 17.8
>>0x1010111010101110 19.7 18.5 19.6 23.4 17.8
>>0x1111113300000000 20.6 18.0 19.1 22.8 17.8
>
>Does this mean that this is the fastest currently known
>FindFirstAndRemove *and* that it doesn't need assembly?
>
>Impressive for sure!
>
>Any similar tricks for lightning-speed popcounting?
>
>What's the fastest sequence you've got for popcounting so
>far? (preferably one that doesn't depend too much on new
>instructions)
>
>--
>GCP
Actually, part of his algorithm included a portion of one population count
algorithm. The best average-case algorithm is the divide-and-conquer algorithm:
int pop_count(u32 x)
{
// x becomes a vector of 2-bit sums (either 0, 1, or 2)
x = ((x & 0xAAAAAAAA) >> 1) + (x & 0x55555555);
// Now sum and reduce
x += ((x & 0xCCCCCCCC) >> 2);
x += ((x & 0x0F0F0F0F) >> 4);
x += ((x & 0x00FF00FF) >> 8);
x += ((x & 0x0000FFFF) >> 16);
// Done.
return x;
}
The 64-bit variant is the same, you just use longer integers and have 1 extra
assignment statement at the end:
x += ((x & 0x00000000FFFFFFFFFF) >> 32);
I couldn't find the older AMD manuals, but a quick Google search yields a page
that reiterates the algorithm and gives AMD's optimal assembly code for it.
(Their assembly contains only instructions present since the 386 and prior.) I
know imul on the Pentium was 10 clocks. I believe I have seen stated in the
literature that imul is now a small cost of 2 or 3 cycles, much to my disbelief.
http://www.df.lth.se/~john_e/gems/gem002d.html
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.