Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Countbits() Function

Author: Roberto Waldteufel
Date: 10:58:13 01/05/99

On January 03, 1999 at 20:47:04, Carsten Kossendey wrote:

>On January 03, 1999 at 14:27:16, Ernst A. Heinz wrote:
>
>>On January 03, 1999 at 10:19:17, Carsten Kossendey wrote:
>>
>>>Jumping onto the same train, I did similar tests on the PowerPC a good while
>>>ago, with around the same results (loops faster up to 6 or 7 bits), which is
>>>interesting considering that the PPC doesn't have BSF/BSR instructions, but you
>>>must rather use two instructions to do that:
>>>
>>>  cntlzw rTemp,   rSource               ; find first 1 bit
>>>  rlwnm. rSource, rSource, rTemp, 0, 30 ; rotate to LSB and mask out
>>>
>>>Carsten
>>
>>No need to use assembler here because the wonders of binary subtraction and the
>>two's complement allow for portable yet efficient ANSI-C formulations of both
>>iterative and non-iterative population-count functions/macros.
>
>[snip]
>
>>unsigned int iterative_popcount(unsigned_64 b) {
>>    unsigned int n;
>>    for (n = 0; b != 0; n++, b &= (b - 1));
>>    return n;
>>}
>
>While this approach generally works, it has two problems:
>
>Firstly, it uses 64 bit math which is sorta uncomfortable for those of us not
>having received any 64 bit machines for free, and not being able to afford such.
>I doubt that any compiler will notice that the above code can be split into two
>halves, each processing 32 bits, with no change in the result, hence it will
>perform sub-sub-sub-optimal due to the (b-1) part always having to use a bunch
>of instructions where one is sufficient.
>
>Secondly, on the PPC, last time I benchmarked it, the cntlzw/rlwnm combo was one
>or two cycles faster than the addi/and combo. And when I write some code which
>is used as often as that, you'll bet your head I want to use every cycle I can
>get.
>
>[snip]
>
>>#define m1 ((unsigned_64) 0x5555555555555555)
>>#define m2 ((unsigned_64) 0x3333333333333333)
>>
>>unsigned int non_iterative_popcount(const unsigned_64 b) {
>>    unsigned_32 n;
>>    const unsigned_64 a = b - ((b >> 1) & m1);
>>    const unsigned_64 c = (a & m2) + ((a >> 2) & m2);
>>    n = ((unsigned_32) c) + ((unsigned_32) (c >> 32));
>>    n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F);
>>    n = (n & 0xFFFF) + (n >> 16);
>>    n = (n & 0xFF) + (n >> 8);
>>    return n;
>>}
>
>This is more or less the code that I use, only that I hand-coded it in assembly
>to work on two 32 bit subvectors in parallel, in order to avoid as many stalls
>as possible.
>
>Carsten

Hi Carsten,

Sorry for the delay - my phone has been out of order and I just got it fixed
today. I used two 32-bit chunks as well, but on my PII there are 8 juicy 64-bit
MMX registers as well as the 32-bit standard registers, and I used these to
speed things up a bit in both the routines. The trouble is that they are limited
in their uses: you can't bit-scan an MMX register, but you can move the lower 32
bits to/from memory/32-bit registers, and you have all logical instructions
shift,and,or,xor,andn etc for the MMX registers. If nothing else, they make a
useful repository for temporarily storing up to 8 64-bit bitboards, with much
faster store/retrieve times than RAM, even assuming an L1 cache hit.

I like the logical instruction routine from an asthaetic point of view, but
almost all my bitboards have too few bits to make it worth while.....:-(

Best wishes,
Roberto
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.