Author: Gerd Isenberg
Date: 16:19:02 11/29/02
Hi all Bitboarders, Recently there was an interesting posting from Joel Venessan. He introduced a bitscan idea via 64-bit mod 67 to get an unique value for each single bit exclusively set in a bitboard, most likely determined by a previous (b & -b) statement. With two small arrays one may map board- and/or 0x88-coordinates to mod67-coordinates and vice versa. But the div/mod instruction is too slow, so i was looking to Paul Hsieh's "Integer division and modulus by constants" site: http://www.azillionmonkeys.com/qed/adiv.html This formula for k*(2^t) with odd k (2^t is power of two here, not xor :-) a % (k*2^t) == a - (((((int)(((2^32)+k-1)/k)) * (a) / (2^32))) & (-2^t))*k is only valid for 32 bit integers. I played a bit around and this is the result for mod 67, don't know whether there is something smarter on 32 bit machines (k is odd, t is zero, & (-1) not necessary): ------------------------------------------------------------------------------- unsigned int mod67 (unsigned __int64 a) { unsigned __int64 h = 0x03D226357E16ECE6 * (a >> 32); // (2^64+67-1)/67 unsigned __int64 l = 0x03D22636 * (a & 0xffffffff); // (2^32+67-1)/67 int modulus = (int)(a - (__int64)((h+l)>>32) * 67); return (unsigned int) modulus + (modulus < 0 ) * 67; } ------------------------------------------------------------------------------- Really surprised by the msc compiler, rather compact code (the mul 67 are some lea-instructions). On my Athlon it's really fast and may be an alternative to the slow "bitscan" instruction pair which are vector path instructions and require inline assembly. The routine is about three times faster than using the pure 64-bit modulo operator %. Whether the function is really exact, due to some rounding errors of the constants, exceeds my math skills. I tried a lot of values in some loops, comparing with %-operator, and found no error so far. At least the function is "safe" enaugh for the desired purpose, doing a kind of bitscan. --- The mod5811 is nice for getting unique values of combined (ored) "from"- and "to"-bitboards, but the range is rather huge and requires arrays of 2*5800 size. It's shlightly faster due to smaller consts, i guess. ------------------------------------------------------------------------------- unsigned int mod5811 (unsigned __int64 a) { unsigned __int64 h = 0xB4725D7BBFAE4 * (a >> 32); // (2^64+5811-1)/5811 unsigned __int64 l = 0xB4726 * (a & 0xffffffff); // (2^32+5811-1)/5811 int modulus = (int)(a - (__int64)((h+l)>>32) * 5811); if ( modulus < 0 ) modulus += 5811; return (unsigned int) modulus; } ------------------------------------------------------------------------------- Another well known bitscan idea for singular bitboards is to use population count algorithms after subtracting one: bitidx 76543210 bit 5 set 00100000 -1 00011111 and count the "ones". Based on the mmx-popcount from AMD's Athlon Optimization Guide i tried a version to count the popularity of two bitboards simultaniously. On my current K7-2.1G i found that subtracting a one-quadword with mmx (which only supports dwords) is faster than in C (the substracted qwords have additional store/load). The routine is faster than the bsf-pair (50% high/low- word probability). Times in seconds for 10^9 runs in a loop K7-2.1G: mmx-parallel popcount 19.7 two asm bsf pairs v1 24.3 mod5811 26.4 two asm bsf pairs v2 26.6 64bit % 5811 72.5 Cheers, Gerd ------------------------------------------------------------------------------- // Athlon only due to pswapd and psadbw // precondition: BBs have exactly one bit set //=========================================== int dualBitSearchMMX(BitBoard bb1, BitBoard bb2) { static const struct { BitBoard C33; BitBoard highone; BitBoard C55; BitBoard C0F; } BitBoardConsts = { 0x3333333333333333, // C33 0x0000000100000000, // highone 0x5555555555555555, // C55 0x0f0f0f0f0f0f0f0f, // C0F }; __asm { pxor mm3, mm3 pxor mm4, mm4 movd mm0, [bb1] punpckldq mm0, [bb1+4] movd mm2, [bb2] punpckldq mm2, [bb2+4] lea eax, [BitBoardConsts] // qword sub one, hammer or p4 have psubq, // but psubd must consider overflow pcmpeqd mm3, mm0 ; sq1 < 32 ? 0xf..0.. : 0x0..f.. pcmpeqd mm4, mm2 ; sq2 < 32 ? 0xf..0.. : 0x0..f.. pswapd mm1, [eax].highone ; low 1 pandn mm3, [eax].highone ; as borrow if sq1 >= 32 pandn mm4, [eax].highone ; as borrow if sq2 >= 32 psubd mm0, mm1 ; bb1 - 1 psubd mm2, mm1 ; bb2 - 1 psubd mm0, mm3 ; bb1 - borrow*(1<<32) psubd mm2, mm4 ; bb2 - borrow*(1<<32) // two parallel popCounts with mm0, mm2 movq mm1, mm0 movq mm3, mm2 psrld mm0, 1 psrld mm2, 1 pand mm0, [eax].C55 pand mm2, [eax].C55 psubd mm1, mm0 psubd mm3, mm2 movq mm0, mm1 movq mm2, mm3 psrld mm1, 2 psrld mm3, 2 pand mm0, [eax].C33 pand mm2, [eax].C33 pand mm1, [eax].C33 pand mm3, [eax].C33 paddd mm0, mm1 paddd mm2, mm3 movq mm1, mm0 movq mm3, mm2 psrld mm0, 4 psrld mm2, 4 paddd mm0, mm1 paddd mm2, mm3 pxor mm1, mm1 pand mm0, [eax].C0F pand mm2, [eax].C0F psadbw mm0, mm1 ; add all bytes psadbw mm2, mm1 ; add all bytes psllq mm0, 6 ; * 64 por mm0, mm2 // 64*popCount(bb1-1) + popCount(bb2-1) movd eax, mm0 } } the rountines i compared with: int BitSearch_v1 (BitBoard bb) { __asm { bsf eax,[bb] jnz found bsf eax,[bb+4] xor eax, 32 found: } } int BitSearch_v2(BitBoard bb) { __asm { bsf eax,[bb+4] xor eax, 32 bsf eax,[bb] } }
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.