Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Countbits() Function

Author: Carsten Kossendey

Date: 17:47:04 01/03/99

Go up one level in this thread


On January 03, 1999 at 14:27:16, Ernst A. Heinz wrote:

>On January 03, 1999 at 10:19:17, Carsten Kossendey wrote:
>
>>Jumping onto the same train, I did similar tests on the PowerPC a good while
>>ago, with around the same results (loops faster up to 6 or 7 bits), which is
>>interesting considering that the PPC doesn't have BSF/BSR instructions, but you
>>must rather use two instructions to do that:
>>
>>  cntlzw rTemp,   rSource               ; find first 1 bit
>>  rlwnm. rSource, rSource, rTemp, 0, 30 ; rotate to LSB and mask out
>>
>>Carsten
>
>No need to use assembler here because the wonders of binary subtraction and the
>two's complement allow for portable yet efficient ANSI-C formulations of both
>iterative and non-iterative population-count functions/macros.

[snip]

>unsigned int iterative_popcount(unsigned_64 b) {
>    unsigned int n;
>    for (n = 0; b != 0; n++, b &= (b - 1));
>    return n;
>}

While this approach generally works, it has two problems:

Firstly, it uses 64 bit math which is sorta uncomfortable for those of us not
having received any 64 bit machines for free, and not being able to afford such.
I doubt that any compiler will notice that the above code can be split into two
halves, each processing 32 bits, with no change in the result, hence it will
perform sub-sub-sub-optimal due to the (b-1) part always having to use a bunch
of instructions where one is sufficient.

Secondly, on the PPC, last time I benchmarked it, the cntlzw/rlwnm combo was one
or two cycles faster than the addi/and combo. And when I write some code which
is used as often as that, you'll bet your head I want to use every cycle I can
get.

[snip]

>#define m1 ((unsigned_64) 0x5555555555555555)
>#define m2 ((unsigned_64) 0x3333333333333333)
>
>unsigned int non_iterative_popcount(const unsigned_64 b) {
>    unsigned_32 n;
>    const unsigned_64 a = b - ((b >> 1) & m1);
>    const unsigned_64 c = (a & m2) + ((a >> 2) & m2);
>    n = ((unsigned_32) c) + ((unsigned_32) (c >> 32));
>    n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F);
>    n = (n & 0xFFFF) + (n >> 16);
>    n = (n & 0xFF) + (n >> 8);
>    return n;
>}

This is more or less the code that I use, only that I hand-coded it in assembly
to work on two 32 bit subvectors in parallel, in order to avoid as many stalls
as possible.

Carsten



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.