Author: Carsten Kossendey
Date: 17:47:04 01/03/99
Go up one level in this thread
On January 03, 1999 at 14:27:16, Ernst A. Heinz wrote: >On January 03, 1999 at 10:19:17, Carsten Kossendey wrote: > >>Jumping onto the same train, I did similar tests on the PowerPC a good while >>ago, with around the same results (loops faster up to 6 or 7 bits), which is >>interesting considering that the PPC doesn't have BSF/BSR instructions, but you >>must rather use two instructions to do that: >> >> cntlzw rTemp, rSource ; find first 1 bit >> rlwnm. rSource, rSource, rTemp, 0, 30 ; rotate to LSB and mask out >> >>Carsten > >No need to use assembler here because the wonders of binary subtraction and the >two's complement allow for portable yet efficient ANSI-C formulations of both >iterative and non-iterative population-count functions/macros. [snip] >unsigned int iterative_popcount(unsigned_64 b) { > unsigned int n; > for (n = 0; b != 0; n++, b &= (b - 1)); > return n; >} While this approach generally works, it has two problems: Firstly, it uses 64 bit math which is sorta uncomfortable for those of us not having received any 64 bit machines for free, and not being able to afford such. I doubt that any compiler will notice that the above code can be split into two halves, each processing 32 bits, with no change in the result, hence it will perform sub-sub-sub-optimal due to the (b-1) part always having to use a bunch of instructions where one is sufficient. Secondly, on the PPC, last time I benchmarked it, the cntlzw/rlwnm combo was one or two cycles faster than the addi/and combo. And when I write some code which is used as often as that, you'll bet your head I want to use every cycle I can get. [snip] >#define m1 ((unsigned_64) 0x5555555555555555) >#define m2 ((unsigned_64) 0x3333333333333333) > >unsigned int non_iterative_popcount(const unsigned_64 b) { > unsigned_32 n; > const unsigned_64 a = b - ((b >> 1) & m1); > const unsigned_64 c = (a & m2) + ((a >> 2) & m2); > n = ((unsigned_32) c) + ((unsigned_32) (c >> 32)); > n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F); > n = (n & 0xFFFF) + (n >> 16); > n = (n & 0xFF) + (n >> 8); > return n; >} This is more or less the code that I use, only that I hand-coded it in assembly to work on two 32 bit subvectors in parallel, in order to avoid as many stalls as possible. Carsten
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.