Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Crafty and NUMA

Author: Robert Hyatt

Date: 07:57:31 09/04/03

Go up one level in this thread


On September 04, 2003 at 05:55:08, Mridul Muralidharan wrote:

>Hi,
>
>  My comments inline.
>
>Regards
>Mridul
>
>On September 03, 2003 at 12:04:01, Robert Hyatt wrote:
>
>>On September 03, 2003 at 05:13:07, Mridul Muralidharan wrote:
>>
><snip>
>>>If it was just a bunch of tweaks that you mention here - I would love to see how
>>>much performance it will give on a 64/128 proc NUMA box :)
>>>I can make a guess - it will suck a**. (No offence to anyone here)
>>
>>You are mixing apples and oranges.  How will it do on a 128 node SMP
>>box?  How will it do on a 128 node NUMA box?  _both_ will not do very well
>>since things are not tuned for that many processors.  However, the original
>>NUMA port did pretty well on a 32 CPU box.  Not as well as it would have done
>>on a 32 CPU SMP box however.  But then NUMA won't _ever_ produce the same
>>level of performance as pure SMP boxes will.  They are just much more
>>affordable.
>>
>
>  If you properly design and implement a version of crafty _for_ NUMA and just
>"tweak" a current version for NUMA (that was orginally written for SMP ?!) -
>then do you expect both to give out same or even comparable performance ?!!

Yes, that's the way I write code.

Here's a list of things I did for the Compaq port.  They address all the NUMA
issues:

1.  no threads.  Can't afford to share code or global data that doesn't change
(IE attack tables for bitmaps, etc.)  This could be done on the current version
easily by changing to fork() rather than clone() on linux.  Or by simply
starting separate processes manually or via something other than fork() (rsh
or whatever).

2.  local split blocks.  This is the critical problem.  A split block contains
all data needed for a search, including chess board, hash stuff, repetition
list, etc.  That _must_ be local to the processor that is using it to do the
search.  And this was the biggest change I had to make.  It wasn't a difficult
change, as I used a global array of pointers to split blocks, but each processor
was responsible for malloc()'ing its own split blocks so that they were local.
The global array of pointers let other processors copy data to a split block
that was local to the processor that was going to help.  This was not a major
project, but there were some "issues" from allocating a split block on the
right processor, to trying to keep "groups" of processors working together
if they were "close" in terms of NUMA access latency.

3.  Transposition table.  Locks are expensive.  Tim and I ended up doing what
Harry Nelson had done for Cray Blitz, the "lockless approach", but only after
the Compaq project was over.  This was an "issue" that needed attention and
today it has been solved, where on the NUMA compaq box it was not so clean.

Sharing the thing invites trouble.  Compaq had some library stuff that allows
you to pre-load things to avoid the access latency, and this helped some.  But
with a lock, there is nothing to do to avoid the problems there since the lock
is shared across a set of routers.

4.  Sharing important stuff.  This has to be handled differently if you
use separate processes, but it is mainly some messy programming rather than
innovative development.

>Ofcourse - I never said crafty wont work - it will - but the performance will be
>pathetic as compared to the NUMA version.

The modified version _would_ be the NUMA version in my case.



>Most likely the performance on a 64 proc NUMA box will turn out to be better
>than a 128 proc NUMA box !!

Depends. This might be a non-NUMA issue however.  IE using a large number
of processors has its own set of issues, whether the underlying platform is
NUMA or not.

But remember, I am _not_ new to NUMA.  I ran programs on the first connection
machine 20 years ago (not chess).


>Hence - to get this working properly on NUMA boxes (not itsy bitsy 8 or 16 proc
>machines - though 16 proc box would be pretty cool to own ;) ) then you _do_
>need a redesign and reimplementation - not a bunch of tweaks.
>When I say redesign/reimplementation - I'm only refering to search , and mem
>management : that is what is usually required.

I agree.  I just disagree on the size of the change.  Crafty has some things
_already_ in it for such architectures in general.



>
><snip>
>>>



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.