Author: Dieter Buerssner
Date: 10:16:18 12/30/03
Go up one level in this thread
On December 30, 2003 at 08:09:21, Gerd Isenberg wrote: Hi Gerd, I did not try to understand your code - seems clever :-). I also do not remember, what your latest fastest LastOne() looked like - did it use an if and only 32-bit muls or did it use a 64-bit mul? (I really hope, the search engine will come back.) >This De Bruijn Sequence generator is very helpfull in finding constants, to do a >kind of double bitscan - finding move indices with from- and to-squares as >single isolated bit bitboards. I found best results so far with the difference >of the two bits, multiplying with some value based on the found DeBruijn >sequence, and shifting the product 64-N bits right. > > idx = ((tobitboard - frombitboard) * f(deBruijn)) >> (64-N); > >Even f(deBruijn) = deBruijn gives interesting result, but xor with some constant >pattern like 0x101010 is more promising. I don't understand the f(deBruijn) part together with the xor. How exactly would f(deBruijn) look then? When I read you sentence, it seems to be the DeBrujn-sequence xored with some constant - can this be right? >Doing move-generation in Steffan Westcott's way, using direction wise sets for >sliding pieces, and with all disjoint piece sets, it is possible to get an index >range of 8K (8192 = 2*64*64). Each unique move index of this scheme does not >only map unique move from-to-squares, but also the kind of piece. This is too much magic for me. I am aware, that there are less than 64*64 legal from-to combinations. But you fold in the piece type as well - how can this work? On my HD I found a file debruijn.ps.gz: Using de Bruijn Sequences to Index a 1 in a Computer Word by Charles E. Leiserson, Harald Prokop, and Keith H. Randall. I guess you know it already, otherwise, it should be easy to find by google. They also mention the indexing exactly "2 bits set problem" and came up with a concrete formula. Slightly related. I thought, the same readers interested in your post might also be interested in the following. I reread parts of that paper, and got reminded of one test I had done - using floating point to find the index of the bit. I looked at some old code snippets of mine, and thought I try to measure the performance with the help of the cycle count (RDTSC instruction) of modern x86 CPUs. I also reread some suggestions of Intel, how to use that instruction and came up with the below source. I get partitially strange results. First the easy part: If I interprete correctly and did not code some bugs, it will take 56 cycles on my P4 to get the bit index of 64-bit word with exactly one bit set (using 80-bit floating point, one could get the index of the most significant bit without isolating one bit). But other results are strange. I tried to measure the RDTSC overhead exactly as suggested by Intel. And for numbers 1 and 2, the bit-index is found faster (incl. the CPUID/RDTSC overhead) than the overhead alone!? The code loops 3 times over all 64-bit numbers with one bit set (one for is commented out, to do the loop in the other direction). The code uses one if, to check if the highest bit is set. In 64 out of 64 cases that bit is not set - so I guessed branch prediction would work nicely - and some overhead just once. But, when I look at the output (the program does 3 loops) the times were the same in any loop, when switching to set high bit and from seth high bit. The code works only with gcc - but it should be very easy to make it work with MSVC. Just change the #defines at the start and use MSVC asm syntax. Change long long to __int64 and also the printf format specifier (or just don't print the 64-bit number). I would be interested, what output others will get (please indicate the hardware). Also, a happy new year to you and everybody else here, Dieter The code: #include <stdio.h> /* Intel's suggestion: use always CPUID, before RDTSC to serialize things. Assuming, measured cycles are < 2**32 - so edx can be ignored */ #define READ_RDTSC(tstamp) do { \ unsigned int d1, d2, d3, d4; \ __asm__ __volatile__( \ "cpuid\n" \ "rdtsc\n" \ : "=a" (d1), "=b" (d2), "=c" (d3), "=d" (d4)); \ tstamp = d1; \ } while(0) #define SUB_RDTSC(tstamp) do { \ unsigned int d1, d2, d3, d4; \ __asm__ __volatile__( \ "cpuid\n" \ "rdtsc\n" \ : "=a" (d1), "=b" (d2), "=c" (d3), "=d" (d4)); \ tstamp = d1-tstamp; \ } while(0) int main(void) { float d=1.0; int i, x; unsigned long long ull; /* volatile, so that the time will not use a register, this should guarantee, that the overhead measurement an the real measurement are done in exactly the same way */ volatile unsigned int ts; /* Intel's suggestion: warm up 3 times to measure overhead, because first "calls" to CPUID may need longer */ READ_RDTSC(ts); SUB_RDTSC(ts); READ_RDTSC(ts); SUB_RDTSC(ts); READ_RDTSC(ts); SUB_RDTSC(ts); printf("overhead %u cycles\n", ts); for (i=0; i<3; i++) /* for (ull=1; ull != 0; ull*=2) */ for (ull = 0x8000000000000000; ull != 0; ull/=2) { READ_RDTSC(ts); /* On x86, there is no fast method to load unsigned 64-bit integer into FPU. So take care of numbers with set "sign bit" seperately */ if ((long long)ull < 0) x = 63; else { /* cast proably really needed. Otherwise the compiler will try to convert the unsigned number, which is inefficient on x86. Because of the if before, the compiler could know more ... */ d = (long long)ull; /* assuming 32 bit unsigned int and IEEE 32-bit little endian float */ /* when using 64-bit floating point d, or 80-bit, a similar method will work. Using 80-bit floats as supported by Intel x86 could save a typical preceding step to isolate the trailing bit. */ x = (*(unsigned int *)&d >> 23) - 127; } SUB_RDTSC(ts); printf("%19Lu: %3d, %u cycles incl. overhead\n", ull, x, ts); } return 0; } And parts of the output on P4, 2.53 GHz: overhead 548 cycles [No floating point needed in first number] 9223372036854775808: 63, 560 cycles incl. overhead [Now the typical case, using floating point] 4611686018427387904: 62, 600 cycles incl. overhead 2305843009213693952: 61, 604 cycles incl. overhead 1152921504606846976: 60, 604 cycles incl. overhead 576460752303423488: 59, 604 cycles incl. overhead 288230376151711744: 58, 604 cycles incl. overhead 144115188075855872: 57, 604 cycles incl. overhead [etc.] 16: 4, 604 cycles incl. overhead 8: 3, 604 cycles incl. overhead 4: 2, 604 cycles incl. overhead 2: 1, 440 cycles incl. overhead 1: 0, 444 cycles incl. overhead [How can the last 2 numbers be explained?] 9223372036854775808: 63, 536 cycles incl. overhead [Also faster than the overhead ...] 4611686018427387904: 62, 604 cycles incl. overhead 2305843009213693952: 61, 604 cycles incl. overhead [etc. basically repeating the above] 4: 2, 604 cycles incl. overhead 2: 1, 440 cycles incl. overhead 1: 0, 444 cycles incl. overhead 9223372036854775808: 63, 528 cycles incl. overhead 4611686018427387904: 62, 604 cycles incl. overhead 2305843009213693952: 61, 604 cycles incl. overhead
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.