Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: The need to unmake move

Author: Gerd Isenberg
Date: 12:15:22 08/20/03
On August 20, 2003 at 13:30:02, Eugene Nalimov wrote:

>On August 20, 2003 at 03:59:38, Johan de Koning wrote:
>
>>On August 19, 2003 at 22:11:14, Robert Hyatt wrote:
>>
>>>On August 19, 2003 at 20:06:58, Mathieu Pagé wrote:
>>>
>>>>Hi,
>>>>
>>>>The fact:
>>>>
>>>>I have this question i read at some place that it is faster to unmake a move
>>>>than to save the state of the game before moving then restoring it when we want
>>>>to unmake the move.
>>>>
>>>>For the moment my engines did not implement unmake() (it is still buggy).
>>>>
>>>>My thougth:
>>>>
>>>>Since bitboard computation are slow (on 32 hardware) i think that it can be
>>>>slower to unmake the move than to save the state. I friend of me that is lot
>>>>better than me at optimizing code also think that.
>>>>
>>>>My questions:
>>>>
>>>>Are you all using unmake() function or there is some of you that found that
>>>>saving the state is better ?
>>>
>>>
>>>
>>>read the comments from Crafty in main.c.  I started out using what is
>>>commonly called "copy/make" as that worked well in Cray Blitz.  But it
>>>didn't work well in the PC.  The PC has very limited memory bandwidth,
>>>when you compare the speed of memory to the speed/demands of current
>>>processors.  If you keep the board in cache, and update it there, it is
>>>more efficient than to copy it from real memory to real memory...
>>
>>I hate to play Vincent here, but real memory is not an issue.
>>
>>If you manage to keep the deepest few plies worth of position structs in L1
>>cache, then bandwith is pretty decent on the PC. And it has been ever since them
>>PCs were endowed with cache.
>>
>>Copying a struct does take time, and it can easily be pinpointed. Saving and
>>restoring and unupdating also takes time, but is harder to identify. Especially
>>since the stress on code cache and branch prediction don't show up in a run time
>>profile.
>>
>>... Johan
>
>Modern high-speed CPUs (PIII/P4/Athlon/AMD64/Itanium2, probably others, as I
>don't know their microarchitecture) are very memcpy-unfriendly. Search for
>"store forwarding" in your Pentium documentation, or use any search engine and
>search the Net.
>
>Very briefly: if you store some data (e.g. 32 bits), and then load only part of
>it (e.g. 8 bits), CPU will stall and wait till the data goes into cache, and
>then will reload it from there. As store buffers work in order that means that
>CPU will first flush earlier data into cache, and that can take quite some time
>because CPU can have up to 24 stores in flight...
>
>Thanks,
>Eugene

A typical rotated bitboard update is four rotated, one of two colors and one of
six pieces. So only 6/12 to write, not considering hashkey updates. But isn't
writing some consecutive cachelines rather independent from some write gaps?
May be the additional reads from node[ply-1] are more expensive in the
copy-domove approach?

Anyway, rather than copy with immediate update the previuosly written data
(reading and writing again), one may try to combine copying with update, e.g. by
reading into mmx-register some xors and then writing mmx to target.

If kind of move is coded in that way, that each move with piece(6+double pawn
push) and each capture with piece(6) takes other piece(5), promotions(4), each
combination of capture promotions (4*5), ep(1) and castles(2) are a consecutive
enumerated range of 7+30+4+20+3 = 64, one need only one inderect jump.

In any case one should avoid loading only a part of fresh written data. Take it
all and mask parts in a register. In this context following paragraphs may be
interesting:

from

Software Optimization
Guide for AMD Athlon™ 64 and
AMD Opteron™ Processors
------------------------------------------------------------------------------
Page 123
Chapter 5 Cache and Memory Optimizations

5.4 Store-to-Load Forwarding Restrictions

Store-to-load forwarding refers to the process of a load reading (forwarding)
data from the store buffer. When this can occur, it improves performance because
the load does not have to wait for the recently written (stored) data to be
written to cache and then read back out again. There are instances in the
load-store architecture of the AMD Athlon 64 and AMD Opteron processors when a
load operation is not allowed to read needed data from a store in the store
buffer.

In these cases, the load cannot complete (load the needed data into a register)
until the store has retired out of the store buffer and written to the data
cache. A store-buffer entry cannot retire and write to the data cache until
every instruction before the store has completed and retired from the reorder
buffer.

The implication of this restriction is that all instructions in the reorder
buffer, up to and including thestore, must complete and retire out of the
reorder buffer before the load can complete. Effectively, the load has a false
dependency on every instruction up to the store. Due to the significant depth of
the LS buffer of the AMD Athlon 64 and AMD Opteron processors, any load that is
dependent on a store that cannot bypass data through the LS buffer may
experience significant delays of up to tens of clock cycles, where the exact
delay is a function of pipeline conditions.

The following sections describe store-to-load forwarding examples.

Store-to-Load Forwarding Pitfalls—True Dependencies

A load is allowed to read data from the store-buffer entry only if all of the
following conditions are satisfied:
• The start address of the load matches the start address of the store.
• The load operand size is equal to or smaller than the store operand size.
• Neither the load nor the store is misaligned.
• The store data is not from a high-byte register (AH, BH, CH, or DH).


.....
------------------------------------------------------------------------------

Page 236

Chapter 9 Optimizing with SIMD Instructions

9.4 Avoid Moving Data Directly Between
General-Purpose and MMX™ Registers

Optimization

Avoid moving data directly between general-purpose registers and MMX™ registers;
this operation requires the use of the MOVD instruction. If it’s absolutely
necessary to move data between these two types of registers, use separate store
and load instructions to move the data from the source register to a temporary
location in memory and then from memory into the destination register,
separating the store and the load by at least 10 instructions.

------------------------------------------------------------------------------

amazing... at least 10 instructions!

Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.