Author: Gerd Isenberg
Date: 12:15:22 08/20/03
Go up one level in this thread
On August 20, 2003 at 13:30:02, Eugene Nalimov wrote: >On August 20, 2003 at 03:59:38, Johan de Koning wrote: > >>On August 19, 2003 at 22:11:14, Robert Hyatt wrote: >> >>>On August 19, 2003 at 20:06:58, Mathieu Pagé wrote: >>> >>>>Hi, >>>> >>>>The fact: >>>> >>>>I have this question i read at some place that it is faster to unmake a move >>>>than to save the state of the game before moving then restoring it when we want >>>>to unmake the move. >>>> >>>>For the moment my engines did not implement unmake() (it is still buggy). >>>> >>>>My thougth: >>>> >>>>Since bitboard computation are slow (on 32 hardware) i think that it can be >>>>slower to unmake the move than to save the state. I friend of me that is lot >>>>better than me at optimizing code also think that. >>>> >>>>My questions: >>>> >>>>Are you all using unmake() function or there is some of you that found that >>>>saving the state is better ? >>> >>> >>> >>>read the comments from Crafty in main.c. I started out using what is >>>commonly called "copy/make" as that worked well in Cray Blitz. But it >>>didn't work well in the PC. The PC has very limited memory bandwidth, >>>when you compare the speed of memory to the speed/demands of current >>>processors. If you keep the board in cache, and update it there, it is >>>more efficient than to copy it from real memory to real memory... >> >>I hate to play Vincent here, but real memory is not an issue. >> >>If you manage to keep the deepest few plies worth of position structs in L1 >>cache, then bandwith is pretty decent on the PC. And it has been ever since them >>PCs were endowed with cache. >> >>Copying a struct does take time, and it can easily be pinpointed. Saving and >>restoring and unupdating also takes time, but is harder to identify. Especially >>since the stress on code cache and branch prediction don't show up in a run time >>profile. >> >>... Johan > >Modern high-speed CPUs (PIII/P4/Athlon/AMD64/Itanium2, probably others, as I >don't know their microarchitecture) are very memcpy-unfriendly. Search for >"store forwarding" in your Pentium documentation, or use any search engine and >search the Net. > >Very briefly: if you store some data (e.g. 32 bits), and then load only part of >it (e.g. 8 bits), CPU will stall and wait till the data goes into cache, and >then will reload it from there. As store buffers work in order that means that >CPU will first flush earlier data into cache, and that can take quite some time >because CPU can have up to 24 stores in flight... > >Thanks, >Eugene A typical rotated bitboard update is four rotated, one of two colors and one of six pieces. So only 6/12 to write, not considering hashkey updates. But isn't writing some consecutive cachelines rather independent from some write gaps? May be the additional reads from node[ply-1] are more expensive in the copy-domove approach? Anyway, rather than copy with immediate update the previuosly written data (reading and writing again), one may try to combine copying with update, e.g. by reading into mmx-register some xors and then writing mmx to target. If kind of move is coded in that way, that each move with piece(6+double pawn push) and each capture with piece(6) takes other piece(5), promotions(4), each combination of capture promotions (4*5), ep(1) and castles(2) are a consecutive enumerated range of 7+30+4+20+3 = 64, one need only one inderect jump. In any case one should avoid loading only a part of fresh written data. Take it all and mask parts in a register. In this context following paragraphs may be interesting: from Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ Processors ------------------------------------------------------------------------------ Page 123 Chapter 5 Cache and Memory Optimizations 5.4 Store-to-Load Forwarding Restrictions Store-to-load forwarding refers to the process of a load reading (forwarding) data from the store buffer. When this can occur, it improves performance because the load does not have to wait for the recently written (stored) data to be written to cache and then read back out again. There are instances in the load-store architecture of the AMD Athlon 64 and AMD Opteron processors when a load operation is not allowed to read needed data from a store in the store buffer. In these cases, the load cannot complete (load the needed data into a register) until the store has retired out of the store buffer and written to the data cache. A store-buffer entry cannot retire and write to the data cache until every instruction before the store has completed and retired from the reorder buffer. The implication of this restriction is that all instructions in the reorder buffer, up to and including thestore, must complete and retire out of the reorder buffer before the load can complete. Effectively, the load has a false dependency on every instruction up to the store. Due to the significant depth of the LS buffer of the AMD Athlon 64 and AMD Opteron processors, any load that is dependent on a store that cannot bypass data through the LS buffer may experience significant delays of up to tens of clock cycles, where the exact delay is a function of pipeline conditions. The following sections describe store-to-load forwarding examples. Store-to-Load Forwarding Pitfalls—True Dependencies A load is allowed to read data from the store-buffer entry only if all of the following conditions are satisfied: • The start address of the load matches the start address of the store. • The load operand size is equal to or smaller than the store operand size. • Neither the load nor the store is misaligned. • The store data is not from a high-byte register (AH, BH, CH, or DH). ..... ------------------------------------------------------------------------------ Page 236 Chapter 9 Optimizing with SIMD Instructions 9.4 Avoid Moving Data Directly Between General-Purpose and MMX™ Registers Optimization Avoid moving data directly between general-purpose registers and MMX™ registers; this operation requires the use of the MOVD instruction. If it’s absolutely necessary to move data between these two types of registers, use separate store and load instructions to move the data from the source register to a temporary location in memory and then from memory into the destination register, separating the store and the load by at least 10 instructions. ------------------------------------------------------------------------------ amazing... at least 10 instructions! Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.