Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: MTD(f) (was Re: New(?) search idea.)

Author: Robert Hyatt

Date: 17:13:09 01/23/98

Go up one level in this thread


On January 23, 1998 at 19:46:40, Don Dailey wrote:

>>I would not say "trivial"... :)
>
>I wrote Cilkchess first as a serial program and got it debugged (ok, it
>turned out
>it was not so well debugged!)    It took me about an hour to port it to
>Cilk.
>However keep in mind I knew it would be a parallel program.
>
>
>
>
>>>Aske Plaat ported an old version of Crafty to Cilk.  Since Crafty
>>>was designed to be a serial program he had some problems with all
>>>the global state.  But even this did not prevent him from porting
>>>it in a few days (even though he was completely unfamiliar with the
>>>code.)   If anyone decides to write a parallel program I highly
>>>recommend encapsulating all your position state (the board and
>>>relevant state) into a structure.   If you are concerned that
>>>creating new states for each node will be too expensive, don't
>>>be.  Modern processors copy contiguous data extremely rapidly
>>>and you avoid some work with keeping unmake information (you never
>>>unmake a move.)   The same code (a big benefit of using cilk) will
>>>produce a serial version of the program that runs very fast and I
>>>promise it will be competitive with your current serial program.
>>>(I think I hear someone in the background screaming something about
>>>data cache but I can't quite make out what they're saying.)
>>
>>you need to run on a PC.  I once did this (copy state) and the best I
>>could do was 5K nodes per sec on a P5/133...  when I got rid of the
>>state
>>copy and started make/unmake, I eventually got past 30K.  The PC only
>>has a sustainable bandwidth of 60-70mb/second.  Compared to 32 gigabytes
>>per second on a Cray.  Or maybe a gig/sec on your CM-5.  You will get
>>killed copying large structures on a PC however...
>
>This sounds way off to me but I don't know you tried to store.   You are
>claiming
>more than 6 to 1 difference for copying state.   This directly implies I
>would need
>about 8 processors to equal the performance of one serial processor with
>make/unmake.
>This means I would be twice as fast if I had NOT run on a 4 SMP's and
>went back to
>make/unmake.
>

I had a roughly 200 byte "state" of all the bitmaps, hash signatures,
castling/ep/50-move status and so forth.  I was doing (in MakeMove) a
copy/update, so that all I had to do was back up a ply and I had no
unmake
to do.  Copying 200 bytes per node is a definite *no no* on a PC.  I
reduced
this over time, but the PC simply has no memory bandwidth to spare.  IE
to
do a read on a P6/200 takes 20+ clock cycles, not to mention that
copying
large blocks like that blows out cache as well.  And while you are
burning
those 20 cycles doing a memory read (only 8 bytes per 20 cycles
remember)
you can't fetch instructions or anything else that is not in cache.



>When I started using this technique (to facilitate parallel programming)
>I was extremely
>concerned about this issue.   I pretty much resigned myself to biting
>the bullet and
>accepting a big slowdown for the greater benefit.    But when I actually
>had a finished
>product and looked at the numbers (and compared to my previous serial
>program with
>make/unmake) I was flabbergasted.  I could not tell the difference.  And
>this was on
>the intel platform using linux.     We had all these hacks in to reduce
>the state size
>too, we were using character variables for the board and so on.  But we
>later figured
>out all of this crap was not neccessary,  a few dozen bytes of state
>copied per move
>is nothing.   We took out the hacks and it sped up slightly!
>
>I have no idea how to resolve this with your experience of 6 to 1 since
>making moves
>is less than 5% of our overhead and they include the state copy.


the 6:1 was not the result of *just* copying state.  There were other
changes to boot. I'd suspect that eliminating my make move approach and
doing copy/update rather than having to unmake later would cost at
*least*
30%.  My mistake was running on a non-PC to start with... like the Cray.
It can copy like nobody's business.  But in a PC, just converting your
chess board from chars to ints will make a very noticable slowdown,
because you start pumping 4x the data, using 4x the cache.  I found that
on the PC, everything possible should be char to fit in the small cache.
And that it is critical to arrange data so that data needed at about the
same time is adjacent in memory so that a cache line fill will fetch
stuff
needed, not stuff that will not be used...



>
>But you said something about copying large structures on a PC.   I am
>not copying
>large structures (there is no need.)   Just the board itself and a few
>variables.
>
>But even if someone thought this might be a big problem,  it's trivial
>to solve.  at any
>node near the end nodes drop the parallelism (our quies is not parallel)
>and update
>the state in place with make/unmake.   Even though our quies is not
>parallel I still
>do the state copies simply because it's no slower than make/unmake.
>
>One very nice side benefit of the position structure (other than never
>having to unmake
>a move which itself is a minor speedup)  is that your code becomes much
>cleaner.
>
>- Don
>
>


the lesson to be learned is that on modern cache-based processors, you
can
do 10X the number of instructions and still be faster if you only cut
the
number of memory loads by 1/2...  and that is a trade-off that is pretty
easy to achieve...




>
>hurt things much at all)  and update the state in



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.