Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: De Bruijn Sequence Generator

Author: Dieter Buerssner

Date: 10:16:18 12/30/03

Go up one level in this thread


On December 30, 2003 at 08:09:21, Gerd Isenberg wrote:

Hi Gerd,

I did not try to understand your code - seems clever :-). I also do not
remember, what your latest fastest LastOne() looked like - did it use an if and
only 32-bit muls or did it use a 64-bit mul? (I really hope, the search engine
will come back.)

>This De Bruijn Sequence generator is very helpfull in finding constants, to do a
>kind of double bitscan - finding move indices with from- and to-squares as
>single isolated bit bitboards. I found best results so far with the difference
>of the two bits, multiplying with some value based on the found DeBruijn
>sequence, and shifting the product 64-N bits right.
>
>     idx = ((tobitboard - frombitboard) * f(deBruijn)) >> (64-N);
>
>Even f(deBruijn) = deBruijn gives interesting result, but xor with some constant
>pattern like 0x101010 is more promising.

I don't understand the f(deBruijn) part together with the xor. How exactly would
f(deBruijn) look then? When I read you sentence, it seems to be the
DeBrujn-sequence xored with some constant - can this be right?

>Doing move-generation in Steffan Westcott's way, using direction wise sets for
>sliding pieces, and with all disjoint piece sets, it is possible to get an index
>range of 8K (8192 = 2*64*64). Each unique move index of this scheme does not
>only map unique move from-to-squares, but also the kind of piece.

This is too much magic for me. I am aware, that there are less than 64*64 legal
from-to combinations. But you fold in the piece type as well - how can this
work?

On my HD I found a file debruijn.ps.gz:

Using de Bruijn Sequences to Index a 1 in a Computer Word
by Charles E. Leiserson, Harald Prokop, and Keith H. Randall.

I guess you know it already, otherwise, it should be easy to find by google.
They also mention the indexing exactly "2 bits set problem" and came up with a
concrete formula.

Slightly related. I thought, the same readers interested in your post might also
be interested in the following. I reread parts of that paper, and got reminded
of one test I had done - using floating point to find the index of the bit. I
looked at some old code snippets of mine, and thought I try to measure the
performance with the help of the cycle count (RDTSC instruction) of modern x86
CPUs. I also reread some suggestions of Intel, how to use that instruction and
came up with the below source. I get partitially strange results. First the easy
part: If I interprete correctly and did not code some bugs, it will take 56
cycles on my P4 to get the bit index of 64-bit word with exactly one bit set
(using 80-bit floating point, one could get the index of the most significant
bit without isolating one bit). But other results are strange. I tried to
measure the RDTSC overhead exactly as suggested by Intel. And for numbers 1 and
2, the bit-index is found faster (incl. the CPUID/RDTSC overhead) than the
overhead alone!?

The code loops 3 times over all 64-bit numbers with one bit set (one for is
commented out, to do the loop in the other direction). The code uses one if, to
check if the highest bit is set. In 64 out of 64 cases that bit is not set - so
I guessed branch prediction would work nicely - and some overhead just once.
But, when I look at the output (the program does 3 loops) the times were the
same in any loop, when switching to set high bit and from seth high bit.

The code works only with gcc - but it should be very easy to make it work with
MSVC. Just change the #defines at the start and use MSVC asm syntax. Change long
long to __int64 and also the printf format specifier (or just don't print the
64-bit number).

I would be interested, what output others will get (please indicate the
hardware).

Also, a happy new year to you and everybody else here,
Dieter

The code:

#include <stdio.h>

/* Intel's suggestion: use always CPUID, before RDTSC to serialize things.
   Assuming, measured cycles are < 2**32 - so edx can be ignored */
#define READ_RDTSC(tstamp) do {                                     \
  unsigned int d1, d2, d3, d4;                                      \
  __asm__ __volatile__(                                             \
    "cpuid\n"                                                       \
    "rdtsc\n"                                                       \
    : "=a" (d1), "=b" (d2), "=c" (d3), "=d" (d4));                  \
    tstamp = d1;                                                    \
} while(0)

#define SUB_RDTSC(tstamp) do {                                      \
  unsigned int d1, d2, d3, d4;                                      \
  __asm__ __volatile__(                                             \
    "cpuid\n"                                                       \
    "rdtsc\n"                                                       \
    : "=a" (d1), "=b" (d2), "=c" (d3), "=d" (d4));                  \
    tstamp = d1-tstamp;                                             \
} while(0)


int main(void)
{
  float d=1.0;
  int i, x;
  unsigned long long ull;
  /* volatile, so that the time will not use a register,
     this should guarantee, that the overhead measurement
     an the real measurement are done in exactly the same
     way */
  volatile unsigned int ts;
  /* Intel's suggestion: warm up 3 times to measure overhead,
     because first "calls" to CPUID may need longer */
  READ_RDTSC(ts);
  SUB_RDTSC(ts);
  READ_RDTSC(ts);
  SUB_RDTSC(ts);
  READ_RDTSC(ts);
  SUB_RDTSC(ts);
  printf("overhead %u cycles\n", ts);
  for (i=0; i<3; i++)
    /* for (ull=1; ull != 0; ull*=2) */
    for (ull = 0x8000000000000000; ull != 0; ull/=2)
    {
      READ_RDTSC(ts);
      /* On x86, there is no fast method to load unsigned 64-bit integer
         into FPU. So take care of numbers with set "sign bit" seperately */
      if ((long long)ull < 0)
        x = 63;
      else
      {
        /* cast proably really needed. Otherwise the compiler will try to
           convert the unsigned number, which is inefficient on x86. Because
           of the if before, the compiler could know more ... */
        d = (long long)ull;
        /* assuming 32 bit unsigned int and IEEE 32-bit little endian float */
        /* when using 64-bit floating point d, or 80-bit, a similar method will
           work. Using 80-bit floats as supported by Intel x86 could save a
           typical preceding step to isolate the trailing bit. */
        x = (*(unsigned int *)&d >> 23) - 127;
      }
      SUB_RDTSC(ts);
      printf("%19Lu: %3d, %u cycles incl. overhead\n", ull, x, ts);
  }
  return 0;
}

And parts of the output on P4, 2.53 GHz:

overhead 548 cycles
[No floating point needed in first number]
9223372036854775808:  63, 560 cycles incl. overhead
[Now the typical case, using floating point]
4611686018427387904:  62, 600 cycles incl. overhead
2305843009213693952:  61, 604 cycles incl. overhead
1152921504606846976:  60, 604 cycles incl. overhead
 576460752303423488:  59, 604 cycles incl. overhead
 288230376151711744:  58, 604 cycles incl. overhead
 144115188075855872:  57, 604 cycles incl. overhead
[etc.]
                 16:   4, 604 cycles incl. overhead
                  8:   3, 604 cycles incl. overhead
                  4:   2, 604 cycles incl. overhead
                  2:   1, 440 cycles incl. overhead
                  1:   0, 444 cycles incl. overhead
[How can the last 2 numbers be explained?]
9223372036854775808:  63, 536 cycles incl. overhead
[Also faster than the overhead ...]
4611686018427387904:  62, 604 cycles incl. overhead
2305843009213693952:  61, 604 cycles incl. overhead
[etc. basically repeating the above]
                  4:   2, 604 cycles incl. overhead
                  2:   1, 440 cycles incl. overhead
                  1:   0, 444 cycles incl. overhead
9223372036854775808:  63, 528 cycles incl. overhead
4611686018427387904:  62, 604 cycles incl. overhead
2305843009213693952:  61, 604 cycles incl. overhead



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.