Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: 0x88 and move generator speed

Author: Vincent Diepeveen
Date: 19:53:39 02/01/01
On February 01, 2001 at 22:15:50, Larry Griffiths wrote:

>On February 01, 2001 at 21:24:13, Vincent Diepeveen wrote:
>
>>On February 01, 2001 at 07:12:07, Dan Newman wrote:
>>
>>>On February 01, 2001 at 03:36:08, David Rasmussen wrote:
>>>
>>>>On January 31, 2001 at 07:15:59, Dan Newman wrote:
>>>>
>>>>>On January 30, 2001 at 06:26:17, David Rasmussen wrote:
>>>>>
>>>>>It just means I don't use rotated bitboards.  I have a bitboard for each
>>>>>different piece type (12 bitboards) + occupied square bitboards for each
>>>>>color (2) + an occupied square bitboard (1).  This means make() and undo()
>>>>>are a bit cheaper since I don't have to update rotated bitboards.  Also,
>>>>>there are a few large data structures (arrays of bitboards) that aren't
>>>>>needed as well--so fewer memory hits.
>>>>>
>>>>>I treat the non-sliding pieces more or less like any other bitboard program,
>>>>>but for sliding pieces I occasionally resort to scanning the board.  I guess
>>>>>it might be thought of as a sort of hybrid of bitboard and mailbox.
>>>>>
>>>>>I suspect (but don't have any hard data) that this is cheaper than doing
>>>>>rotated bitboards, and it's also much simpler to implement.
>>>>>
>>>>>OTOH, I think Bas (Hamstra) may have switched to rotated bitboards and
>>>>>found them to be faster...
>>>>>
>>>>>-Dan.
>>>>
>>>>OK, so essentially you don't have any smart way of calculating the attackboard
>>>>of sliding pieces??
>>>>
>>>>I mean, the normal method of extracting the state of the file, rank or >diagonal,
>>>>and using this to quickly calculate an attackboard, cannot be used in a
>>>>non-rotated bitboard design.
>>>>
>>>>Isn't this just worse in all cases compared to rotated bitboards (except for >the
>>>>simpler design maybe)?
>>>
>>>I calculate a "pseudo-attack" bitboard for sliders.  (That's a bitboard
>>>that has a bit turned on for each enemy piece on which the slider is
>>>bearing, ignoring any blocking pieces.)  Then I start extracting the
>>>coordinates of the "victims".  For each such coordinate I test to see if
>>>any pieces are blocking the attack using a blocking-mask bitboard which
>>>has all the bits turned on for squares that lie between the attacker and
>>>victim.  If no blockers are there, we have a real victim.
>>>
>>>I don't think this differs much in cost from rotated bitboards.  In
>>>some cases it's probably faster (sparse board), in others it's probably
>>>slower than rotated bitboards.  Even if rotated bitboards happen to be a
>>>little faster generating captures, they will certainly slow down make()
>>>and undo().
>>>
>>>I don't think that it's really easy to figure out which is better, either by
>>>reason or by experiment.  You can't just count instructions executed--there
>>>are memory/cache effects to be considered.  Also, the quality of the
>>>implementations will vary, and that is likely to have an even bigger effect.
>>>So I might add rotated bitboards and find it slows me down--only because of
>>>poor design.
>>>
>>>-Dan.
>>
>>Could you post please your LastOne() function
>>in assembly or if it is in C and also write down what
>>processor you use?
>>
>>Of course only if it's not the same as crafty, if it is
>>the same i don't see how you can ignore the macro that
>>eats most system time of the entire generation!
>
>Have a look at mine, Vincent.
>label is just a parm so that I can use the code several times in my program.
>Count is where the last moves were added to the move list.
>bb1 is a bitboard of moves or captures.

is that a 32 bits result bitboard with captures for a certain
piece?
I see you load it in ESI (correct me if i'm wrong,
but that's a 32 bits register isn't it?), how to scan
the upper 32 bits of the bitboard?

>output is where a capture or non-capture movelist starts.

in the memory i guess?

>I also have a MoveModel that has FromPieceType, FromSquare, and Flags filled in
>before using this macro.
>tcmovesize is defined as 16 which is the size of each of my movelist entries.
>The code basically loads the movemodel into a MMX register, then uses the "bsf"
>instruction to get a square number from a capture or move bitboard.  This is
>or'ed with the MMX register containing the move model and then stored in the
>move list.  I have a seperate extract for Pawns that adds a value to the
>extracted square to get the "from" square.  The From and To squares are then
>or'ed with the MoveModel and stored into a capture/move list.
>Larry.
>
>
>
>#define defbbExtractAndOrMoveModel(label,Count,bb1,output)\
>	{\
>	_ESI = (unsigned long)bb1;\
>	_EDX = (unsigned long)output;\
>	asm	mov	ECX,[Count];\
>	asm	mov	EDI,dword ptr [ESI];\
>	asm	shl	ECX,4;\
>	asm	mov	ESI,dword ptr [ESI+4];\
>	asm	movq	mm0,qword ptr [bbTCMoveModel];\
>	asm	movq	mm2,qword ptr [bbTCMoveModel+8];\
>	asm	label##loop1:;\
>	asm	bsf	EAX,EDI;\
>	asm	jz		label##loop2;\
>	asm	btr	EDI,EAX;\
>	asm	movd	mm1,EAX;\
>	asm	psllq	mm1,32;\
>	asm	por	mm1,mm0;\
>	asm	movq	[EDX+ECX],mm1;\
>	asm	movq	[EDX+ECX+8],mm2;\
>	asm	add	ECX,tcmovesize;\
>	asm	jmp	label##loop1;\
>	asm	label##loop2:;\
>	asm	bsf	EAX,ESI;\
>	asm	jz		label##Done;\
>	asm	btr	ESI,EAX;\
>	asm	add	EAX,32;\
>	asm	movd	mm1,EAX;\
>	asm	psllq	mm1,32;\
>	asm	por	mm1,mm0;\
>	asm	movq	[EDX+ECX],mm1;\
>	asm	movq	[EDX+ECX+8],mm2;\
>	asm	add	ECX,tcmovesize;\
>	asm	jmp	label##loop2;\
>	asm	label##Done:;\
>	asm	shr	ECX,4;\
>	asm	mov	[Count],ECX;\
>	}

Hope you also detect whether you run on a P4 or not,
because on P4 the MMX instructions are that pathetic
slow executing each that it's not longer smart to use MMX.

But i don't see the speedup when compared to crafty when
talking about branches here. You use on average 2 JZ instructions
for each move. Let's do a very non-educated guess and estimate
that from each 2 JZ instructions at least 1 is mispredicted.

That means that you lose in number of clocks (at least) on the
  P3 : 10
  P4 : 20

So you waste 10 CLOCKS on a P3!!!!!!!
Just to get a single instruction out of a bitboard!
Added to that is a lot of extra overhead also, as you use
a lot of complex instructions. psllq i need to figure out what
it does but perhaps you can already post that!.
Re: 0x88 and move generator speed Severi Salminen 03:59:03 02/02/01
- Re: 0x88 and move generator speed Vincent Diepeveen 18:33:25 02/02/01
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.