Author: Gerd Isenberg
Date: 12:25:03 06/12/04
Go up one level in this thread
On June 12, 2004 at 14:21:07, Russell Reagan wrote:
>On June 12, 2004 at 13:09:30, Gian-Carlo Pascutto wrote:
>
>>On June 12, 2004 at 13:05:14, Sergei S. Markoff wrote:
>>
>>>But the result isn't impressive. This works slower than previous version. I have
>>>Athlon-1700XP+ and Intel C/C++ v6 compiler (v7 produces slower code).
>>>
>>>What was wrong?
>>
>>bsf and consorts are slow on Athlons.
>>
>>The best thing seems to be one of the "Magic bitscans".
>>
>>const BITBOARD magic = 0x03f08c5392f756cdULL;
>>
>>const unsigned int magictable[64] = {
>>0, 1, 12, 2, 13, 22, 17, 3,
>>14, 33, 23, 36, 18, 58, 28, 4,
>>62, 15, 34, 26, 24, 48, 50, 37,
>>19, 55, 59, 52, 29, 44, 39, 5,
>>63, 11, 21, 16, 32, 35, 57, 27,
>>61, 25, 47, 49, 54, 51, 43, 38,
>>10, 20, 31, 56, 60, 46, 53, 42,
>>9, 30, 45, 41, 8, 40, 7, 6,
>>};
>>
>>unsigned int FindFirst (const BITBOARD b) {
>> const BITBOARD lsb = b & -b;
>> return magictable[(lsb * magic) >> 58];
>>}
>>
>>--
>>GCP
>
>I found this one, by Eugene Nalimov, to be the fastest on my 32-bit Athlon.
>
>int FirstOne (Bitboard arg) {
> int result = 0;
>
> if (arg > 0xFFFFFFFF) {
> arg >>= 32;
> result = 32;
> }
>
> if (arg > 0xFFFF) {
> arg >>= 16;
> result += 16;
> }
>
> if (arg > 0xFF) {
> arg >>= 8;
> result += 8;
> }
>
> return result + table8[arg];
>}
>
>table8[] is an 8-bit lookup table (256 elements) that returns the 'first one'
>for an 8-bit value. You can also try using a 16-bit lookup table and get rid of
>the last if-statement.
Eugene's firstOne is fine for Sergei's bsr-firstOne, while Gian-Carlo's
64-bit De Bruijn mul is ok for Sergei's bsf-lastOne.
In my current code it most often doesn't matter whether i traverse bitboards
firstwise or lastwise, best results with bsf (see below).
Im not sure for x86-32 whether the 64-bit mul is faster than folded 32-bit mul
by Matt Taylor or Walter Faxon's magic bitscan:
const UINT32 LSB_64_table[154] = // or BYTE
{
#define __ 0
23,__,__,__,31,__,__,39,19,__, 17,16,18,__,47,10,20, 9, 8,11,
1, 0, 2,57,56,58, 3,12,__,59, __,__,21,__, 4,__,__,60,__,__,
__,__,__,13,__,__,__,__,__,__, 5,__,__,61,__,__,__,__,__,__,
__,__,__,__,22,__,__,__,30,__, __,38,__,__,__,14,__,__,46,__,
__,__, 6,__,__,62,__,__,__,54, __,__,__,__,__,__,__,__,__,__,
29,__,__,37,__,__,__,__,__,__, 45,__,__,__,__,__,28,__,__,36,
__,53,__,__,27,__,44,35,26,24, 25,34,32,33,43,40,41,52,42,15,
__,50,48,49,__,51, 7,__,__,63, __,__,__,55
#undef __
};
#define LSB_64_adj -51 // offset to table base
#define LSB_64_magic ( (UINT32)0x01C5FC81 ) // magic constant
UINT32 WalterFaxonsMagicBitscan(BitBoard &bb)
{
BitBoard t64;
UINT32 t32;
t64 = bb - 1;
bb &= t64; // omit this line to retain current LSB
t64 ^= bb;
t32 = (UINT32)t64 ^ (UINT32)(t64 >> 32);
t32 ^= LSB_64_magic;
t32 += t32 >> 16;
t32 -= t32 >> 8;
return LSB_64_table [LSB_64_adj + t32];
}
const unsigned int lsz64_tbl[154] =
{
63,30, 3,32,59,14,11,33,
60,24,50, 9,55,19,21,34,
61,29, 2,53,51,23,41,18,
56,28, 1,43,46,27, 0,35,
62,31,58, 4, 5,49,54, 6,
15,52,12,40, 7,42,45,16,
25,57,48,13,10,39, 8,44,
20,47,38,22,17,37,36,26,
};
unsigned int MattTaylorsLastOne(BitBoard &bb)
{
BitBoard lsb = bb ^ (bb - 1);
bb &= ~lsb; // omit this line to retain current LSB
unsigned int foldedLSB = ((int) lsb) ^ ((int)(lsb>>32));
return lsz64_tbl[foldedLSB * 0x78291ACF >> 26];
}
What is finally the fastest may vary from program to program, whether how often
it is used and inlined. For me (Athlon XP2.8) bsf/bsr are still fastest in that
way:
unsigned int bitScanAndReset(BitBoard &bb)
{
__asm
{
mov ebx, [bb]
xor eax, eax
mov edx, 1
bsf ecx, [ebx]
jnz found
bsf ecx, [ebx + 4]
lea ebx, [ebx + 4]
xor eax, 32
found:
shl edx, cl
xor eax, ecx
xor [ebx], edx
}
}
or in this "dirty" ones without reset found bit:
unsigned int bitScanForeward(BitBoard bb) // last one
{
__asm
{
bsf eax, [bb+4]
xor eax, 32
bsf eax, [bb]
}
}
unsigned int bitScanBackward(BitBoard bb) // first one
{
__asm
{
xor edx, edx
bsr eax, [bb]
bsr eax, [bb+4]
setnz dl
shl edx, 5
add eax, edx
}
}
Similar may true for AMD64. Single 64-bit bsf instruction versus 64-bit de
Bruijn multiplication. Single 64-bit bsr instruction versus Eugene's routine.
64-bit bsf/bsr are vector path instructions, nine cycles each.
Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.