Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: BitScan with reset - not so impressive with 3DNow!

Author: Matt Taylor

Date: 06:46:27 12/03/02

Go up one level in this thread


On December 02, 2002 at 19:43:15, Gerd Isenberg wrote:

>On December 02, 2002 at 18:49:10, Walter Faxon wrote:
>
>>On December 02, 2002 at 16:48:52, Gerd Isenberg wrote:
>>
>>>May be i found not the best way to to b&(-b) with mmx, but building the 64-bit
>>>two's complement with mmx-dword is not so nice. First you have to do the one's
>>>complement by pxor -1, then comparing low dword with -1 and building an
>>>conditional overflow, adding 00:01 or 01:01...
>>>So using 32-bit registers was the fastest so far, but that may require some
>>>additional push/pop.
>>>
>>>Some times mesured in seconds with this dumb loop (nothing inlined):
>>>K7XP2.1+ ~1.8GHz
>>>
>
>Congrats, Walter!!
>
>10-bit pattern		bsf	PI2FD	btr	c      LSB_64
>0x0000000011111133	15.3	18.0	19.1	22.8    17.8
>0x1010111010101110	19.7	18.5	19.6	23.4    17.8
>0x1111113300000000	20.6	18.0	19.1	22.8    17.8
>
>>>
>>>inlined are ~5 seconds faster
>>Hi, Gerd.
>>
>>As long as you're test-comparing bit-search-and-reset codes, I wonder if you
>>could please consider also comparing my C code version, posted on CCC with the
>>subject "Another hacky method for bitboard bit extraction" on November 17.  I
>>repeat it below.  You can of course make those changes required for proper
>>compilation and comparison in your setup.  Thanks!
>>
>>-- Walter
>>
>>Code follows:
>>// ---------------------------------------------------------------------------
>>
>>typedef unsigned long long  u64;    // nonstandard
>>typedef unsigned long       u32;
>>typedef unsigned char       u8;
>>
>>extern const u8 LSB_64_table[154];              // bit number table
>>#define LSB_64_adj  -51                         // offset to table base
>>#define LSB_64_magic  ( (u32)0x01C5FC81 )       // magic constant
>>
>>// ---------------------------------------------------------------------------
>>// LSB_64() -- find, remove, report least-significant bit of 64.
>>// Argument 'bb' must be non-null.  Method:  fold then table lookup.
>>// Written by Walter Faxon, June 2002.  No copyright.  No warranty.
>>//
>>inline                  // inline declaration may differ by compiler
>>u8 LSB_64( u64* bb )
>>    {
>>    u64 t64;
>>    u32 t32;
>>    t64 = *bb - 1;
>>    *bb &= t64;         // omit this line to retain current LSB
>>    t64 ^= *bb;
>>    t32 = (u32)t64 ^ (u32)(t64 >> 32);
>>    t32 ^= LSB_64_magic;
>>    t32 += t32 >> 16;
>>    t32 -= t32 >> 8;
>>    return LSB_64_table [LSB_64_adj + (u8)t32];
>>    }
>>
>>// ---------------------------------------------------------------------------
>>// Table reports number of low-order bit as 0, high-order as 63.
>>// (Numbering can be reversed by changing this table.)
>>// Important:  arrange storage so that this table is kept in the cache.
>>const u8 LSB_64_table[154] =
>>    {
>>#define __  0
>>    23,__,__,__,31,__,__,39,19,__, 17,16,18,__,47,10,20, 9, 8,11,
>>     1, 0, 2,57,56,58, 3,12,__,59, __,__,21,__, 4,__,__,60,__,__,
>>    __,__,__,13,__,__,__,__,__,__,  5,__,__,61,__,__,__,__,__,__,
>>    __,__,__,__,22,__,__,__,30,__, __,38,__,__,__,14,__,__,46,__,
>>    __,__, 6,__,__,62,__,__,__,54, __,__,__,__,__,__,__,__,__,__,
>>    29,__,__,37,__,__,__,__,__,__, 45,__,__,__,__,__,28,__,__,36,
>>    __,53,__,__,27,__,44,35,26,24, 25,34,32,33,43,40,41,52,42,15,
>>    __,50,48,49,__,51, 7,__,__,63, __,__,__,55
>>#undef __
>>    };
>>
>>//eof
>>
>>P.S.  You can even avoid the table lookup if you're willing to deal with
>>scrambled square indecies in the range 0 to 153.
>>
>>P.P.S.  I feel a little like the unlucky scientist whose results were lost
>>because they were only published _four_ times... :)
>
>Geniously, sorry Walter, that i overlooked your approach. It is clearly the
>fastest of these routines. De Bruijn Sequence? May have a close look to the
>constant.
>
>I'm really impressed.
>
>Regards,
>Gerd

I measured the time (in clocks) to find all 10 bits on my AthlonMP 1600:

10-bit pattern		bsf	PI2FD	btr	LSB/VC7	LSB/Me
0x0000000011111133	659	919	779	699	699
0x1010111010101110	722-761	919	779	699	699
0x1111113300000000	729	919	779	699	699

The first three routines were taken instruction-for-instruction from this
thread. The second-to-last is the C version optimized by the VC 7 compiler. The
last is a hand-tweaked version I wrote, very basically optimized (lots of
register contention and partial stalls). I can post specific assembly listings
if someone is -that- interested. I didn't include the other C routine given by
Gerd.

The bsf instruction varied quite a bit in execution time. (The second pattern
varied significantly in execution time over multiple trials, strangely
enough...) All the other times are +/- 1-3 clocks, but I would only take the
numbers accurate to about 10 clocks.

I think it's fairly clear who the winner is, though.
Very slick, Walter.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.