Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: extracting bit from bitboard slow on P4 in C

Author: Robert Hyatt

Date: 15:20:05 11/18/02

Go up one level in this thread


On November 18, 2002 at 14:49:40, Vincent Diepeveen wrote:

>On November 18, 2002 at 13:42:32, Robert Hyatt wrote:
>
>The P4 needs in theory 2 cycles each shift. If my math was just 0.001% off
>there, then Nalimov would already have posted it i'm sure of that :)

Your math is off.  The PIV reference manual _clearly_ states that latency for
the
shift instructions is 4, unless you use the CL register which runs it up a bit
longer.

But four is the number in the intel reference for the PIV that I have here...
and those
refer to clock cycles of the cpu core...  which it says "is a maximum of four"
for
immediate-8 values for the shift count..

This book can be downloaded from Intel's web site if you want to see it in black
and white...


>
>In fact he wants to do without shift instructions that's how matters are
>here. But it says something on how important compilers are for crappy
>processors!
>
>>On November 18, 2002 at 11:02:48, Vincent Diepeveen wrote:
>>
>>>On November 18, 2002 at 11:01:28, Robert Hyatt wrote:
>>>
>>>>On November 18, 2002 at 10:20:49, Vincent Diepeveen wrote:
>>>>
>>>>>On November 17, 2002 at 13:30:22, Eugene Nalimov wrote:
>>>>>
>>>>>6 compares   times 0.5 cycles ==>  3 cycles
>>>>>5 shiftlefts times 2   cycles ==> 10  cycles
>>>>>
>>>>>So it is like 13 cycles to get a single bit out of a bitboard in
>>>>>this C code is very slow on the P4. 13 cycles is hell of a lot,
>>>>>i can generate several moves in non-bitboards within that time!
>>>>
>>>>Unless something has changed, the P4 is clocked 2x internally.  A shift should
>>>>take 1/2 clock cycle if I am reading their manual correctly...
>>>
>>>No it takes 2 cycles. The shift is hell slower than other operations
>>>like ADD.
>>
>>My intel reference for the PIV says "up to 4 internal clock cycles (four
>>micro-ops)" of
>>latency.  A shift instruction can execute significantly faster than this
>>depending on what
>>else is mixed into the micro-op queue."
>>
>>But, knowing that, the above code can be handled without shifts anyway so it is
>>really
>>a moot point...
>>
>>
>>>
>>>>That turns this into 5.5 cycles which seems more reasonable.  No reason a shiftl
>>>>would be that much slower than a compare...  And your counting doesn't make any
>>>>sense to me anyway as 6 compares by themselves won't do a thing.  it would
>>>>require
>>>>jumps or cmov's or something...
>>>>
>>>>
>>>>>
>>>>>>On November 17, 2002 at 13:25:55, Vincent Diepeveen wrote:
>>>>>>
>>>>>>>On November 17, 2002 at 12:01:02, Gerd Isenberg wrote:
>>>>>>>
>>>>>>>>On November 17, 2002 at 10:53:05, Joel wrote:
>>>>>>>>
>>>>>>>>>Hey All,
>>>>>>>>>
>>>>>>>>>I am a 2nd year Uni student from Australia who has recently gotten into chess
>>>>>>>>>programming. My first attempt was a simple array-based alpha-beta variant which
>>>>>>>>>struggled to search more than 6 levels deep in most positions! I think that
>>>>>>>>>might have something to do with the fact that there was no move ordering,
>>>>>>>>>transposition table, an expensive evaluation function, no killer moves and weak
>>>>>>>>>coding :)
>>>>>>>>>
>>>>>>>>>I have been working on my second attempt for some time now. It uses Bitboards. I
>>>>>>>>>have a few questions regarding move generation.
>>>>>>>>>
>>>>>>>>>It seems to me that the performance of the Bitboard approach relies somewhat
>>>>>>>>>heavily on how fast you can retrieve the position of a 1 bit within a 64-bit
>>>>>>>>>unsigned integer. I looked for sometime on the Internet for some kind of
>>>>>>>>>magical, hacky solution to this dilemna, and the best I could find was this (b &
>>>>>>>>>-b) trick which I used in a debatedly clever way. I was just wondering if there
>>>>>>>>>is any approach significantly better than the one which I will outline below:
>>>>>>>>>
>>>>>>>>>1. (b & -b) to clear all 1 bit's except for one.
>>>>>>>>>2. get this value, mod it by 67 (which has the property that every possible
>>>>>>>>>   value returned is unique, thus i can hash to the position of the bit in the
>>>>>>>>>   64 bit integer.)
>>>>>>>>>
>>>>>>>>>I am no expert, but it doesn't seem too ineffecient to me. Any problems?
>>>>>>>>>
>>>>>>>>>Also, if there are any improvements, I would prefer to find out about the ones
>>>>>>>>>which do not involve assembly coding - I do not want to make my program too
>>>>>>>>>dependant on any one CPU architecture at this stage.
>>>>>>>>>
>>>>>>>>>Thanks for your time,
>>>>>>>>>Joel
>>>>>>>>
>>>>>>>>Hi Joel,
>>>>>>>>
>>>>>>>>nice idea with the mod 67 to get unique bitvalues. But i fear a 64 bit div/mod
>>>>>>>>operation is too slow, even with 64bit-processors. If you don't want to use
>>>>>>>>assembler eg. intels x86 bsf (bit scan foreward), i think using a lookup table
>>>>>>>>indexed by the byte or 16-bit word is most common.
>>>>>>>>
>>>>>>>>I use this approach:
>>>>>>>>
>>>>>>>>#define LOWBOARD(bb) (*((UINT32*)&(bb)))
>>>>>>>>#define HIGHBOARD(bb) (*(((UINT32*)&(bb))+1))
>>>>>>>>
>>>>>>>>// precondition: bb not null
>>>>>>>>__forceinline UINT BitSearch(BitBoard bb)
>>>>>>>>{
>>>>>>>>        ASSERT(bb != 0);
>>>>>>>>#ifdef	_M_IX86
>>>>>>>>	__asm
>>>>>>>>	{
>>>>>>>>		bsf	eax,[bb+4]
>>>>>>>>		xor	eax, 32
>>>>>>>>		bsf	eax,[bb]
>>>>>>>>	}
>>>>>>>>#else
>>>>>>>>	BitBoard lsbb = bb & (-(__int64)bb);
>>>>>>>>	UINT32 lsb = LOWBOARD(lsbb) | HIGHBOARD(lsbb);
>>>>>>>>	return ((((((((((HIGHBOARD(lsbb)!=0) <<1)
>>>>>>>>			+((lsb & 0xffff0000)!=0))<<1)
>>>>>>>>			+((lsb & 0xff00ff00)!=0))<<1)
>>>>>>>>			+((lsb & 0xf0f0f0f0)!=0))<<1)
>>>>>>>>			+((lsb & 0xcccccccc)!=0))<<1)
>>>>>>>>			+((lsb & 0xaaaaaaaa)!=0);
>>>>>>>>#endif
>>>>>>>>}
>>>>>>>>
>>>>>>>>Regards,
>>>>>>>>Gerd
>>>>>>>
>>>>>>>Isn't shifting on the P4 *WAY* slower than his mod 67 idea is?
>>>>>>
>>>>>>No.
>>>>>>
>>>>>>Thanks,
>>>>>>Eugene



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.