Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: FirstOne/LastOne

Author: Gerd Isenberg

Date: 12:25:03 06/12/04

Go up one level in this thread


On June 12, 2004 at 14:21:07, Russell Reagan wrote:

>On June 12, 2004 at 13:09:30, Gian-Carlo Pascutto wrote:
>
>>On June 12, 2004 at 13:05:14, Sergei S. Markoff wrote:
>>
>>>But the result isn't impressive. This works slower than previous version. I have
>>>Athlon-1700XP+ and Intel C/C++ v6 compiler (v7 produces slower code).
>>>
>>>What was wrong?
>>
>>bsf and consorts are slow on Athlons.
>>
>>The best thing seems to be one of the "Magic bitscans".
>>
>>const BITBOARD magic = 0x03f08c5392f756cdULL;
>>
>>const unsigned int magictable[64] = {
>>0, 1, 12, 2, 13, 22, 17, 3,
>>14, 33, 23, 36, 18, 58, 28, 4,
>>62, 15, 34, 26, 24, 48, 50, 37,
>>19, 55, 59, 52, 29, 44, 39, 5,
>>63, 11, 21, 16, 32, 35, 57, 27,
>>61, 25, 47, 49, 54, 51, 43, 38,
>>10, 20, 31, 56, 60, 46, 53, 42,
>>9, 30, 45, 41, 8, 40, 7, 6,
>>};
>>
>>unsigned int FindFirst (const BITBOARD b) {
>> const BITBOARD lsb = b & -b;
>> return magictable[(lsb * magic) >> 58];
>>}
>>
>>--
>>GCP
>
>I found this one, by Eugene Nalimov, to be the fastest on my 32-bit Athlon.
>
>int FirstOne (Bitboard arg) {
>    int result = 0;
>
>    if (arg > 0xFFFFFFFF) {
>        arg >>= 32;
>        result = 32;
>    }
>
>    if (arg > 0xFFFF) {
>        arg >>= 16;
>        result += 16;
>    }
>
>    if (arg > 0xFF) {
>        arg >>= 8;
>        result += 8;
>    }
>
>    return result + table8[arg];
>}
>
>table8[] is an 8-bit lookup table (256 elements) that returns the 'first one'
>for an 8-bit value. You can also try using a 16-bit lookup table and get rid of
>the last if-statement.


Eugene's firstOne is fine for Sergei's bsr-firstOne, while Gian-Carlo's
64-bit De Bruijn mul is ok for Sergei's bsf-lastOne.

In my current code it most often doesn't matter whether i traverse bitboards
firstwise or lastwise, best results with bsf (see below).

Im not sure for x86-32 whether the 64-bit mul is faster than folded 32-bit mul
by Matt Taylor or Walter Faxon's magic bitscan:

const UINT32 LSB_64_table[154] = // or BYTE
    {
#define __  0
    23,__,__,__,31,__,__,39,19,__, 17,16,18,__,47,10,20, 9, 8,11,
     1, 0, 2,57,56,58, 3,12,__,59, __,__,21,__, 4,__,__,60,__,__,
    __,__,__,13,__,__,__,__,__,__,  5,__,__,61,__,__,__,__,__,__,
    __,__,__,__,22,__,__,__,30,__, __,38,__,__,__,14,__,__,46,__,
    __,__, 6,__,__,62,__,__,__,54, __,__,__,__,__,__,__,__,__,__,
    29,__,__,37,__,__,__,__,__,__, 45,__,__,__,__,__,28,__,__,36,
    __,53,__,__,27,__,44,35,26,24, 25,34,32,33,43,40,41,52,42,15,
    __,50,48,49,__,51, 7,__,__,63, __,__,__,55
#undef __
    };

#define LSB_64_adj  -51                         // offset to table base
#define LSB_64_magic  ( (UINT32)0x01C5FC81 )       // magic constant

UINT32 WalterFaxonsMagicBitscan(BitBoard &bb)
{
    BitBoard t64;
    UINT32 t32;
    t64 = bb - 1;
    bb &= t64;  // omit this line to retain current LSB
    t64 ^= bb;
    t32 = (UINT32)t64 ^ (UINT32)(t64 >> 32);
    t32 ^= LSB_64_magic;
    t32 += t32 >> 16;
    t32 -= t32 >> 8;
    return LSB_64_table [LSB_64_adj + t32];
}


const unsigned int lsz64_tbl[154] =
{
 63,30, 3,32,59,14,11,33,
 60,24,50, 9,55,19,21,34,
 61,29, 2,53,51,23,41,18,
 56,28, 1,43,46,27, 0,35,
 62,31,58, 4, 5,49,54, 6,
 15,52,12,40, 7,42,45,16,
 25,57,48,13,10,39, 8,44,
 20,47,38,22,17,37,36,26,
};

unsigned int MattTaylorsLastOne(BitBoard &bb)
{
    BitBoard lsb = bb ^ (bb - 1);
    bb &= ~lsb;  // omit this line to retain current LSB
    unsigned int foldedLSB = ((int) lsb) ^ ((int)(lsb>>32));
    return lsz64_tbl[foldedLSB * 0x78291ACF >> 26];
}

What is finally the fastest may vary from program to program, whether how often
it is used and inlined. For me (Athlon XP2.8) bsf/bsr are still fastest in that
way:

unsigned int bitScanAndReset(BitBoard &bb)
{
  __asm
  {
    mov    ebx, [bb]
    xor    eax, eax
    mov	   edx, 1
    bsf	   ecx, [ebx]
    jnz	   found
    bsf	   ecx, [ebx + 4]
    lea	   ebx, [ebx + 4]
    xor	   eax, 32
found:
    shl	   edx, cl
    xor    eax, ecx
    xor    [ebx], edx
  }
}

or in this "dirty" ones without reset found bit:

unsigned int bitScanForeward(BitBoard bb) // last one
{
  __asm
  {
    bsf    eax, [bb+4]
    xor    eax, 32
    bsf    eax, [bb]
  }
}

unsigned int bitScanBackward(BitBoard bb) // first one
{
  __asm
  {
    xor    edx, edx
    bsr    eax, [bb]
    bsr    eax, [bb+4]
    setnz  dl
    shl    edx, 5
    add	   eax, edx
  }
}

Similar may true for AMD64. Single 64-bit bsf instruction versus 64-bit de
Bruijn multiplication. Single 64-bit bsr instruction versus Eugene's routine.

64-bit bsf/bsr are vector path instructions, nine cycles each.

Gerd




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.