Author: Roberto Waldteufel
Date: 10:58:13 01/05/99
Go up one level in this thread
On January 03, 1999 at 20:47:04, Carsten Kossendey wrote: >On January 03, 1999 at 14:27:16, Ernst A. Heinz wrote: > >>On January 03, 1999 at 10:19:17, Carsten Kossendey wrote: >> >>>Jumping onto the same train, I did similar tests on the PowerPC a good while >>>ago, with around the same results (loops faster up to 6 or 7 bits), which is >>>interesting considering that the PPC doesn't have BSF/BSR instructions, but you >>>must rather use two instructions to do that: >>> >>> cntlzw rTemp, rSource ; find first 1 bit >>> rlwnm. rSource, rSource, rTemp, 0, 30 ; rotate to LSB and mask out >>> >>>Carsten >> >>No need to use assembler here because the wonders of binary subtraction and the >>two's complement allow for portable yet efficient ANSI-C formulations of both >>iterative and non-iterative population-count functions/macros. > >[snip] > >>unsigned int iterative_popcount(unsigned_64 b) { >> unsigned int n; >> for (n = 0; b != 0; n++, b &= (b - 1)); >> return n; >>} > >While this approach generally works, it has two problems: > >Firstly, it uses 64 bit math which is sorta uncomfortable for those of us not >having received any 64 bit machines for free, and not being able to afford such. >I doubt that any compiler will notice that the above code can be split into two >halves, each processing 32 bits, with no change in the result, hence it will >perform sub-sub-sub-optimal due to the (b-1) part always having to use a bunch >of instructions where one is sufficient. > >Secondly, on the PPC, last time I benchmarked it, the cntlzw/rlwnm combo was one >or two cycles faster than the addi/and combo. And when I write some code which >is used as often as that, you'll bet your head I want to use every cycle I can >get. > >[snip] > >>#define m1 ((unsigned_64) 0x5555555555555555) >>#define m2 ((unsigned_64) 0x3333333333333333) >> >>unsigned int non_iterative_popcount(const unsigned_64 b) { >> unsigned_32 n; >> const unsigned_64 a = b - ((b >> 1) & m1); >> const unsigned_64 c = (a & m2) + ((a >> 2) & m2); >> n = ((unsigned_32) c) + ((unsigned_32) (c >> 32)); >> n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F); >> n = (n & 0xFFFF) + (n >> 16); >> n = (n & 0xFF) + (n >> 8); >> return n; >>} > >This is more or less the code that I use, only that I hand-coded it in assembly >to work on two 32 bit subvectors in parallel, in order to avoid as many stalls >as possible. > >Carsten Hi Carsten, Sorry for the delay - my phone has been out of order and I just got it fixed today. I used two 32-bit chunks as well, but on my PII there are 8 juicy 64-bit MMX registers as well as the 32-bit standard registers, and I used these to speed things up a bit in both the routines. The trouble is that they are limited in their uses: you can't bit-scan an MMX register, but you can move the lower 32 bits to/from memory/32-bit registers, and you have all logical instructions shift,and,or,xor,andn etc for the MMX registers. If nothing else, they make a useful repository for temporarily storing up to 8 64-bit bitboards, with much faster store/retrieve times than RAM, even assuming an L1 cache hit. I like the logical instruction routine from an asthaetic point of view, but almost all my bitboards have too few bits to make it worth while.....:-( Best wishes, Roberto
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.