Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: MP engines

Author: Robert Hyatt
Date: 09:37:22 12/26/03
On December 26, 2003 at 10:58:18, Mridul Muralidharan wrote:

>
>Comments inline ....
>
>On December 25, 2003 at 10:12:11, Robert Hyatt wrote:
>
>>On December 25, 2003 at 04:55:44, Mridul Muralidharan wrote:
>>
>>>Hi all,
>>>
>>>  I would like to know the list of MP engines that are currently active -
>>>amateur , commercial , and private (not in-development efforts :) )
>>>
>>>The list that I could come up with was :
>>>
>>>Baron (upto 2 or even higher ? )
>>>Brutus
>>>Crafty
>>>Diep
>>>Fritz
>>>Junior
>>>Shredder
>>>Sjeng
>>>SOS (or is it ParSOS ?)
>>>
>>>Of these , which all can scale > 8 procs ?
>>>Diep I know for sure.
>>>SOS also I think.
>>>What about rest ?
>>
>>Crafty certainly does, given the right kind of architecture.  I've run it
>>on 32-way boxes for example.
>
>
>Which is this kind of hardware ?

32 way alpha-based machine...


>I have not seen any results of crafty on a 32 or higher proc box ....
>Could you give me some link, etc for this ?


I don't know that there is anything on the net relative to that.  It was
done a while back on ND so all I could mention was NPS and so forth,not
the actual hardware details.  I also ran it on a Cray T932 (with 32 processors)
very early in the parallel search development, when I had a chance to fiddle
with Cray Blitz on that machine as well.



>(And I dont have any data of cray blitz also ...)
>Thanks
>
>
>>
>>>
>>>
>>>Of these which all are NUMA compatible.
>>>Latest crafty , Diep.
>>>Anyone else ?
>>
>>Be careful.  There is NUMA and there is NUMA.
>
>Hmm - fundamentals of design - design for generic case - specialise for specific
>case :)
>So if design is to support numa - then max yu need is tweaks for specific
>hardware - but not much else changes.

Correct.  But those tweaks are interlaced through the program.  allocating
memory, tying a thread to a specific processor (or node on machines where a
node has more than one CPU such as SGI), etc...

>If you design for smp and try to extend for numa - you are in for shit !
>But then , it is pointless to get into this discussion again - we already had
>this before : You will not agree that crafty design is not good for NUMA and I
>will not agree to the contrary :)
>
>>Current Crafty supports NUMA
>>under windows, which pretty well limits it to Intel and AMD machines.
>
>I could not get crafty to work on aix or irix - otherwise could have given this
>new numa code in crafty a shot at a 16 or 32 box machine .... but from looks of
>it - and this is a very personal opinion backed with zero data - it wont scale -
>atleast nowhere close to your formula given below :)

Did you notice the caveat above?  "windows only".  So of course it is not
going to scale wel on AIX or IRIX.  That needs _specific_ programming for
each of those, and that has not been done.  All that has been done to
date is windows and Linux, and the linux stuff is far from complete because
linux NUMA support is far from "ready for prime time."




>
>
>> Diep
>>is running on SGI, which is not particularly compatible with anybody else.
>
>I am sure that diep scales on any other numa platform - remember - teras runs
>irix which is pretty similar to linux.

Not their NUMA API however... that is the issue.  And the tera (SGI) box
is far different from other NUMA machines.  IE the concept of a "node"
on an SGI box means multiple cpus with a common local memory.  On an AMD,
for example, "node" means "cpu".


>
>But this is for Vincent to comment on - not me ! :)
>
>
>>I have an experimental NUMA version of Crafty, but it is not well-checked out
>>yet (it is a linux version using libnuma for the NUMA stuff, but linux kernels
>>are very spotty in their NUMA support to date.)
>>
>
>Too true ! Linux is yet to get there ... hopefully next year mid it should be
>pretty good and stable.
>
>
>>Eventually Crafty will support NUMA on linux and windows.  Others may be added
>>if time permits and hardware becomes available for testing.
>>
>
>Great ! and I assume - with only "tweaks" to get it to work ?!! ;)

Yes.  There will be functions like MallocInterleaved() which I will define
myself, and then inside there, it will be necessary to do whatever the OS
demands so that the malloc() is spread across all nodes/processors.  There
will also be a MallocLocal() which allocates only on local memory.  Etc.  IE
I will define the Crafty NUMA API, and then have my own NUMA library that then
has to be modifed to work with whatever OS we are interested in, without
having to get into the Crafty code itself...



>
>>
>>
>>
>>>
>>>Can anyone provide with what is the usual speedup on a Quad for these engines -
>>>say a typical middle game position ?
>>
>>My formula for speedup is this:
>>
>>speedup = 1 + (ncpus - 1) * .7
>>
>
>
>Thanks for this - do you have an upper limit by which this formula stops being
>accurate due to diminishing returns ?

No.  I have verified that it works fine through 8-way.  The limited 16-way
time I have had on alphas in the past was spent on other issues.  IE to get
decent numbers on a bigger alpha, the hash locks had to go, which took time
to do.  However, based on analysis, I would speculate that the above will
hold true so long as the hardware works.  IE some 8-way Intel boxes will
_not_ scale very well, as they use the same memory architecture as they use
for 4-way, and it runs out of bandwidth with 2x processors.  So the scaling
on larger boxes depends on the box.  IE on the Cray T932, it will scale to 32
easily at that same formula, but then the Cray is a "special machine" form a
performance perspective.  IE it doesn't run out of bandwidth.





>Also , how much depth does crafty needs to achieve before it starts giving this
>kindof speedup ?
>For mess , in the first few plies speedup sucks ....

Seems to work find at 1 sec/move, for example, and I run it in bullet all
the time.  For searches of 4-5 plies, it isn't going to do well, but on
reasonable hardware, 1 second ought to get to well beyond 8 and it should
run just fine.

If you are interested, I could run some .1 second searches to see what
happens, but the problem there is that the timer is not that accurate and
quantization errors will overwhelm the real numbers.





>
>
>>IE for a quad, that gives 3.1x which is a pretty good approximation.
>>
>>
>>
>>
>>
>>>(In case you need a machine config to base numbers on - what about a Quad Xeon 3
>>>GHz 3 Gig RAM running OS of your choice).
>>
>>CPU speed really doesn't matter so long as there are no hardware bottlenecks
>>to deal with.  IE my formula above works just as well on my quad pentium-pro
>>200 as it does on my quad xeon 700 and my dual xeon (with hyperthreading on)
>>2.8ghz.  It also fit the quad opteron 2ghz machine just fine also.
>
>This was given just so that I dont get flames like "What hardware" from people
>who know almost nothing about parallel programming ;)
>
>
>Thanks for your comments
>
>Mridul
>
>>
>>
>>
>>
>>
>>
>>>
>>>
>>>Thanks in advance
>>>Mridul
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.