Author: Gerd Isenberg
Date: 16:19:02 11/29/02
Hi all Bitboarders,
Recently there was an interesting posting from Joel Venessan. He introduced a
bitscan idea via 64-bit mod 67 to get an unique value for each single bit
exclusively set in a bitboard, most likely determined by a previous (b & -b)
statement. With two small arrays one may map board- and/or 0x88-coordinates to
mod67-coordinates and vice versa. But the div/mod instruction is too slow, so i
was looking to Paul Hsieh's "Integer division and modulus by constants" site:
http://www.azillionmonkeys.com/qed/adiv.html
This formula for k*(2^t) with odd k (2^t is power of two here, not xor :-)
a % (k*2^t) == a - (((((int)(((2^32)+k-1)/k)) * (a) / (2^32))) & (-2^t))*k
is only valid for 32 bit integers. I played a bit around and this is the result
for mod 67, don't know whether there is something smarter on 32 bit machines
(k is odd, t is zero, & (-1) not necessary):
-------------------------------------------------------------------------------
unsigned int mod67 (unsigned __int64 a)
{
unsigned __int64 h = 0x03D226357E16ECE6 * (a >> 32); // (2^64+67-1)/67
unsigned __int64 l = 0x03D22636 * (a & 0xffffffff); // (2^32+67-1)/67
int modulus = (int)(a - (__int64)((h+l)>>32) * 67);
return (unsigned int) modulus + (modulus < 0 ) * 67;
}
-------------------------------------------------------------------------------
Really surprised by the msc compiler, rather compact code (the mul 67 are some
lea-instructions). On my Athlon it's really fast and may be an alternative to
the slow "bitscan" instruction pair which are vector path instructions and
require inline assembly. The routine is about three times faster than using the
pure 64-bit modulo operator %.
Whether the function is really exact, due to some rounding errors of the
constants, exceeds my math skills.
I tried a lot of values in some loops, comparing with %-operator, and found no
error so far. At least the function is "safe" enaugh for the desired purpose,
doing a kind of bitscan.
---
The mod5811 is nice for getting unique values of combined (ored) "from"- and
"to"-bitboards, but the range is rather huge and requires arrays of 2*5800 size.
It's shlightly faster due to smaller consts, i guess.
-------------------------------------------------------------------------------
unsigned int mod5811 (unsigned __int64 a)
{
unsigned __int64 h = 0xB4725D7BBFAE4 * (a >> 32); // (2^64+5811-1)/5811
unsigned __int64 l = 0xB4726 * (a & 0xffffffff); // (2^32+5811-1)/5811
int modulus = (int)(a - (__int64)((h+l)>>32) * 5811);
if ( modulus < 0 )
modulus += 5811;
return (unsigned int) modulus;
}
-------------------------------------------------------------------------------
Another well known bitscan idea for singular bitboards is to use population
count algorithms after subtracting one:
bitidx 76543210
bit 5 set 00100000
-1 00011111 and count the "ones".
Based on the mmx-popcount from AMD's Athlon Optimization Guide i tried a version
to count the popularity of two bitboards simultaniously. On my current K7-2.1G i
found that subtracting a one-quadword with mmx (which only supports dwords) is
faster than in C (the substracted qwords have additional store/load). The
routine is faster than the bsf-pair (50% high/low- word probability).
Times in seconds for 10^9 runs in a loop K7-2.1G:
mmx-parallel popcount 19.7
two asm bsf pairs v1 24.3
mod5811 26.4
two asm bsf pairs v2 26.6
64bit % 5811 72.5
Cheers,
Gerd
-------------------------------------------------------------------------------
// Athlon only due to pswapd and psadbw
// precondition: BBs have exactly one bit set
//===========================================
int dualBitSearchMMX(BitBoard bb1, BitBoard bb2)
{
static const struct
{
BitBoard C33;
BitBoard highone;
BitBoard C55;
BitBoard C0F;
} BitBoardConsts =
{
0x3333333333333333, // C33
0x0000000100000000, // highone
0x5555555555555555, // C55
0x0f0f0f0f0f0f0f0f, // C0F
};
__asm
{
pxor mm3, mm3
pxor mm4, mm4
movd mm0, [bb1]
punpckldq mm0, [bb1+4]
movd mm2, [bb2]
punpckldq mm2, [bb2+4]
lea eax, [BitBoardConsts]
// qword sub one, hammer or p4 have psubq,
// but psubd must consider overflow
pcmpeqd mm3, mm0 ; sq1 < 32 ? 0xf..0.. : 0x0..f..
pcmpeqd mm4, mm2 ; sq2 < 32 ? 0xf..0.. : 0x0..f..
pswapd mm1, [eax].highone ; low 1
pandn mm3, [eax].highone ; as borrow if sq1 >= 32
pandn mm4, [eax].highone ; as borrow if sq2 >= 32
psubd mm0, mm1 ; bb1 - 1
psubd mm2, mm1 ; bb2 - 1
psubd mm0, mm3 ; bb1 - borrow*(1<<32)
psubd mm2, mm4 ; bb2 - borrow*(1<<32)
// two parallel popCounts with mm0, mm2
movq mm1, mm0
movq mm3, mm2
psrld mm0, 1
psrld mm2, 1
pand mm0, [eax].C55
pand mm2, [eax].C55
psubd mm1, mm0
psubd mm3, mm2
movq mm0, mm1
movq mm2, mm3
psrld mm1, 2
psrld mm3, 2
pand mm0, [eax].C33
pand mm2, [eax].C33
pand mm1, [eax].C33
pand mm3, [eax].C33
paddd mm0, mm1
paddd mm2, mm3
movq mm1, mm0
movq mm3, mm2
psrld mm0, 4
psrld mm2, 4
paddd mm0, mm1
paddd mm2, mm3
pxor mm1, mm1
pand mm0, [eax].C0F
pand mm2, [eax].C0F
psadbw mm0, mm1 ; add all bytes
psadbw mm2, mm1 ; add all bytes
psllq mm0, 6 ; * 64
por mm0, mm2
// 64*popCount(bb1-1) + popCount(bb2-1)
movd eax, mm0
}
}
the rountines i compared with:
int BitSearch_v1 (BitBoard bb)
{
__asm
{
bsf eax,[bb]
jnz found
bsf eax,[bb+4]
xor eax, 32
found:
}
}
int BitSearch_v2(BitBoard bb)
{
__asm
{
bsf eax,[bb+4]
xor eax, 32
bsf eax,[bb]
}
}
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.