Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Crafty and NUMA

Author: Robert Hyatt

Date: 08:48:04 09/03/03

Go up one level in this thread


On September 03, 2003 at 05:46:52, Mridul Muralidharan wrote:

>Hi,
>
>  Wanted to respond with something similar to what is already said down this
>thread :)
>
>Only a few comments.
>
>I read the paper describing the DTS implementation to work with recursive crafty
>- looks more like a workaround to me, albeit a clever one.
>I dont think you have truely appreciated the issues involved in getting crafty
>to work on a 128 or 256 or 512 proc ccNUMA box (like the one vincent is working
>on) - IF crafty does work with the minor changes/tweaks , then I will be
>extremely surprised ! but with all due respect , I dont think I will be :)
>
>Regards
>Mridul

The point was that this has already been done once.  About 2 years ago or
so.  Compaq wanted me to test their UPC compiler on a NUMA-based alpha machine
they produce.

I didn't say that getting something to work on 512 processors is easy.  What
I _did_ say was that getting crafty to work on a NUMA platform was not that
hard, because I had already done that.  Getting something to work on a 512
processor SMP box (if such a thing were practical to build) would _still_ be
a real challenge.

But NUMA is not _that_ hard to work around.  You simply need to understand the
weaknesses of the NUMA architecture, and the details of the current chess search
algorithm, and then make them work together.  This is mainly concerned with
two issues:

(1) local vs non-local memory.  Everything possible needs to be put into local
memory, even if it is replicated N times for N processors.  Things that are
shared need to be put in the local memory of the processor that will use them
most frequently.

(2) latency will affect how deeply in the tree splits can be done, without
allowing the latency to become the dominant term in the time-to-complete-the-
subtree-search accounting...

for (2) I already have limits in crafty that are adjustable, for that very
reason.  I can limit splits to exclude the last N plies of the normal search,
where N can be set to any value between 1 and infinity.  I also have the ability
to tie processors into "clusters" so that a cluster works together on a sub-tree
without having _all_ idle processors jump in at the same point.   Too many at
one split point introduces too much synchronization overhead...

(1) was the main issue I had to deal with in Crafty.  Since I ran N copies of
the program (not N threads) all the normal stuff was in local memory by
default.  But for split blocks, I had to change the way they were allocated
so that groups of N split blocks were allocated on each different processor
so that they would be in local memory.  Then the block allocator code was
changed to prefer split blocks in a processor's local memory.

It wasn't _that_ hard.

Using 512 processors is a different thing altogether, and with or without
NUMA, it is not easy.

>
>On September 02, 2003 at 11:00:13, Robert Hyatt wrote:
>
>>On September 01, 2003 at 06:09:48, Mridul Muralidharan wrote:
>>
>>>Hi Jeremiah,
>>>
>>>  If you want crafty to get to work with a decent speedup on a 16 or 32 CPU
>>>cc-numa where you have say 2 - 4 processors per node with a significant
>>>inter-node latency (like most higher cpu numa boxes ?!)  , you will have to
>>>ensure that the way you split , memory usage , etc is optimal - you dont want to
>>>access a hash entry in proc 0 from proc 32 when the latency wil be in
>>>milliseconds !!!
>>>
>>>Hope you are appreciating the real world problems - not theoretical issues.
>>>If you actually work on these boxes , you will appreciate the problems faced by
>>>these developers more - using threads on such a box for parallelism , urggh !!
>>>
>>>Regards
>>>Mridul
>>>
>>>PS : Forget getting crafty to work on the 500 CPU beast that Vincent is working
>>>on without a total crafty rewrite ! The horrors Vincent must be facing is
>>>unimaginable - the once he has already mentioned in this forum and I'm sure ,
>>>the more horrible ones he may not have !
>>>
>>
>>A total rewrite is _not_ needed.  The search is already designed to work on
>>_any_ type of parallel machine.  The issue is allocating data structures on
>>the right processor.  This will _not_ be a major change.  I currently use an
>>array of split blocks.  What I need is an array of pointers to split blocks,
>>so that each processor can allocate a few split blocks in its local memory.
>>Then the block allocator simply has to prefer split blocks in the processor
>>that will be using them, when trying to allocate a split block for a parallel
>>thread to use.
>>
>>It is something I have on my list of things to do, but the real issue is
>>"how to allocate _local_ memory" reliably, without wrecking things on non-
>>NUMA machines?
>>
>>That's why I haven't looked at this lately.  I looked at it a year ago on a
>>NUMA alpha box, but unfortunately the code was lost when the disk on that
>>machine crashed with no backups.  I got a new disk, but the source changes
>>were lost.  This was written around Compaq's UPC compiler..
>>
>>
>>>On August 30, 2003 at 10:40:03, Vincent Diepeveen wrote:
>>>
>>>>On August 29, 2003 at 23:41:32, Jeremiah Penery wrote:
>>>>
>>>>>On August 29, 2003 at 18:40:23, Mridul Muralidharan wrote:
>>>>>
>>>>>>I'm not sure of what/why Prof. Bob Hyatt may have made those comments. But to
>>>>>>get a program like crafty to work properly in a numa machine will not be trivial
>>>>>>- and it wont be tweaks , but something more.
>>>>>
>>>>>All multi-CPU Opteron machines are NUMA.  Crafty will work just fine in those.
>>>>>It will not be theoretically optimal, but that also depends on the OS to help
>>>>>with NUMA issues.
>>>>
>>>>The OS has to do very little for chessprograms. Just keep scheduling the same
>>>>process at the same cpu and physically allocating local memory at that cpu's
>>>>RAM.
>>>>
>>>>Of course for a lot of other services the OS has to do a lot different, yet in
>>>>chessprograms we do not need it as most of us, except for example cilkchess,
>>>>write their parallellism at a very low level.
>>>>
>>>>>>Duals , etc count as SMP machine not cc-numa which I was refering to.
>>>>>Dual Opterons are NUMA.
>>>>
>>>>And soon all duals that we can afford will be.
>>>>
>>>>Best regards,
>>>>Vincent



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.