Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Population of disjoint Attacksets

Author: Dieter Buerssner

Date: 13:42:53 05/30/04

Go up one level in this thread


Perhaps your results are not strange at all. Your function just seems to be fast
when everything is inlined :-) On my computer, my method is a bit faster.

I tried a few things. Some observations. With MSVC 6, changing your odd() and
maj() functions to macros gives reproducably faster times here. 4.24 vs 4.49 ns
per 32-bit popcount (This would translate to 84.8 vs 89.8 s in your 1e9 loop). I
also used clock() for timing, but did it slightly different than you with some
#define tricks, so that each function will be called inside exactly the same
environment (each test run for only one function).

With Gcc, using inline instead of the macro, was faster ...

My pc_10bb needs 3.94 ns. It was coded not very well. One can also code it
without the need of temporary arrays (code even gets smaller). Both Gcc and MSVC
can translate it, so that it will need 6 registers (so no temporary space on the
stack at all). An unrolled version will need 5 registers (about 280
instructions). I show the loop version at the end.

This somewhat more carfully loop version needs 3.35ns. Unrolled 3.08 ns. This is
only 7.8 cycles per 32-bit popcount. The fastest I got your function, was 4.24
ns/32-bit word (using MSVC, macro odd/maj and a version of popcount similar to
yours, but without multiplication, and adding the counters one stage earlier -
all called functions inlined).

I tested with an array of 100 or 1000 64-bit words, and looped through the array
with increment 10. In each word, from 0 to 128 bits were set randomly (one bit
can be set more than once). Typical output:

Testing pc_10bb3
  0:   0.450 s
  1:   0.441 s
  2:   0.440 s
  3:   0.451 s
  4:   0.441 s
  5:   0.450 s
  6:   0.451 s
  7:   0.451 s
  8:   0.450 s
 16:   0.451 s
 32:   0.441 s
 64:   0.460 s
128:   0.451 s
sum:   5.828 s for 87240400 calls, 66.804 ns/call, 3.340 ns/32-bit word

(The intention of the differently populated words was first, to see when the
typical x &= x-1 loop popcount starts to get worse, than other methods like
table lookup or the popcount at the start of this thread. Also it might help a
bit to catch errors.).

I checked, that all returns were correct. The number of calls is just so, that
the sum of found bits (added up in the testing loop) can just not overflow a 32
bit counter. The testing loop was actually 2 loops. One typically calling the
function 100 times, where the actual counting function can possibly be inlined
(it depends on #define, not on automatic inlining by the compiler), and on outer
testing loop (which calls the inner testing loop function non inlined).


Cheers,
Dieter

int pc_10bb3(BU x[])
{
  unsigned int a, b, c, i;
  unsigned int *xw = (unsigned int *)x; /* Of course not really portable */

  /* Start with the two "odd" words, then 6 triples */
  a = xw[0];
  a = a - ((a>>1) & 0x55555555);
  a = (a & 0x33333333) + ((a>>2) & 0x33333333);
  b = xw[1];
  xw += 2;
  b = b - ((b>>1) & 0x55555555);
  a += (b & 0x33333333) + ((b>>2) & 0x33333333);
  c = (a & 0x0f0f0f0f) + ((a>>4) & 0x0f0f0f0f);
  i = 6;
  do
  {
    a = xw[0];
    a = a - ((a>>1) & 0x55555555);
    a = (a & 0x33333333) + ((a>>2) & 0x33333333);
    b = xw[1];
    b = b - ((b>>1) & 0x55555555);
    a += (b & 0x33333333) + ((b>>2) & 0x33333333);
    b = xw[2];
    xw += 3;
    b = b - ((b>>1) & 0x55555555);
    a += (b & 0x33333333) + ((b>>2) & 0x33333333);
    c += (a & 0x0f0f0f0f) + ((a>>4) & 0x0f0f0f0f);
  }
  while (--i);
  /* Maybe, it is possible to save some masks in the last 2 steps
     but it won't really matter */
  c = (c & 0x00ff00ff) + ((c>>8)  & 0x00ff00ff);
  // return (c & 0x0000ffff) + ((c>>16) & 0x0000ffff);
  c += c >> 16;
  return c & 0xffff;
}



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.