Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Crafty and NUMA

Author: Robert Hyatt
Date: 10:40:06 09/03/03
On September 03, 2003 at 12:16:11, Vincent Diepeveen wrote:

>On September 03, 2003 at 12:00:58, Robert Hyatt wrote:
>
>>On September 03, 2003 at 09:26:31, Vincent Diepeveen wrote:
>>
>>>On September 02, 2003 at 18:37:05, Jeremiah Penery wrote:
>>>
>>>>On September 02, 2003 at 07:15:55, Mridul Muralidharan wrote:
>>>>
>>>>>
>>>>><snip>
>>>>>On September 01, 2003 at 09:39:55, Jeremiah Penery wrote:
>>>>>
>>>>>>
>>>>>>Any large (multi-node) SMP machine will have the same problem as NUMA with
>>>>>>respect to inter-node latency.  SMP doesn't magically make node-to-node
>>>>>>communication any faster.
>>>>>
>>>>>Pardon my saying so , but it looks like you have very little idea about SMP and
>>>>>NUMA.
>>>>
>>>>If I didn't have some idea what I was talking about, I wouldn't be talking,
>>>>unlike a lot of people in these discussions.
>>>>
>>>>> Refer to cray architecture , an opteron 8 way box architecture , and some
>>>>>IBM supercomp cc-NUMA based system architecture docs for more info. I'm not
>>>>
>>>>Those machines are designed and built for *completely* different purposes.  You
>>>>might as well compare the documentation for a P4 to that of an UltraSPARC, for
>>>>all the good it would do you.
>>>>
>>>>>refering to just theoretical differences , or _only_ architecture differences -
>>>>>but as a programmer - what details that need to be taken care of while writing
>>>>>apps for such a system.
>>>>
>>>>And those details would be what, other than the aforementioned theoretical or
>>>>architectural differences?
>>>>
>>>>>>But in reality, almost nobody uses a machine that big, especially for chess.
>>>>>
>>>>>The question was - can it be done , is it just a bunch of tweaks - not do you
>>>>>have a system.
>>>>>Answer : Yes it cn be done , needs lots of rewrite - not just "tweaks".
>>>>
>>>>Not really.  Bob said he already completed the changes, and it didn't really
>>>>involve much.  Only instead of forking processes he had to manually start
>>>>processes on each processor.  That really doesn't take much work.
>>>
>>>But that's of course not true.
>>>
>>>Why are you believing this nonsense?
>>
>>Would you like a name at Compaq?  They sent me an alpha, and a NDA copy of
>>their UPC compiler to do this work.  I didn't publish anything due to the NDA
>>of course, but that has lapsed and the compiler is now commercially available.
>>
>>Why do _you_ write this nonsense???
>
>So
>  a) you either lose your source code when you achieve something important
>     (a numa version of crafty)

Yep.  Although I didn't consider it very "important".  I was interested in
the NUMA approach with several alpha processors (21264's).  It worked OK.  Not
great, but then the testing was interrupted when the disk went south.  Compaq
replaced the disk, but the changes were lost.  I have the notes, and I'll fix
this again one day so that you can then claim those results are faked as well.

Of course you don't produce _any_ results, which says a lot about _your_
research...


>  b) you can't find even a cray executable of cray blitz when someone
>     offered to rerun your tests to verify your speedup numbers

False.  I have an executable.  I don't have any source code.  Cray keeps _all_
executables on backup tape.  Of course, don't let real facts get in the way of
your imagination.

>  c) you signed NDA that proofs that crafty runs well at cc-NUMA machines
>     with microseconds latencies if i understand well here


I didn't say it "ran well".  I said "it ran ok".  I was getting a speedup of
about 12 on 32 processors.  I don't consider that "well" at all.  Some test
cases were much worse, only a few were better.  In fact, that 12x might be
closer to 10x on average.  But the code was _not_ optimized for the NUMA
world except for the memory allocation for split blocks and replicated local
data I mentioned.  I didn't (at the time) have code to limit the number of
processors working together at a single point, although that is now in the
current version of Crafty.  I didn't use the lockless hashing algorithm, which
certainly hurt performance on the NUMA box as locks were horrible.





>
>So stop the nonsense here Bob.
>
>Show outputs of crafty at cc-NUMA machines with random latencies > 1 microsecond
>or a worst case one way pingpong time than 500 ns.
>
>It never worked at them of course and never will.

It did, and it will.  Of course, anything you didn't do is "impossible"
as always.  Thankfully I don't live in "that world".




>
>There is very little concrete outputs of crafty > 4 cpu's. All i remember is
>some very expensive 16 processor alpha which ran crafty shortly.

That was a non-NUMA box.  And it ran well, on ICC even.  NUMA is definitely
different, but it definitely worked, and it will work again...




>
>But that machine is not even remote to the latencies that real cc-NUMA machines
>heve. If you mail me a version of crafty that compiles at a R14000 i can do a
>few runs for you at 16 cpu's or whatever number you want up to 130 without
>problems. 500 even.
>
>But nalimov's stuff doesn't compile at the mipspro compiler. Dunno why.
>
>I have run other software than DIEP too at up to 16 cpu's and they all do better
>than Crafty.
\


So?  I have told you repeatedly that Crafty won't work well on NUMA now.
That is apparently your biggest problem, you can't understand simple
statements and explanations.

You have also said nobody can get a decent speedup using a cluster with
100mbit ethernet.  Yet others have done just that, including Schaeffer.
Again "impossible" == "Vincent can't do it."






>
>>
>>>
>>>>>>For any but the most extremely scalable architectures, there is significant
>>>>>>diminishing returns when adding processors for chess playing.  I'd say that a
>>>>>>very scalable 8-way SMP or NUMA (Opteron) machine will not be very much slower
>>>>>>than even a 64-way Alpha/Itanium/xxx machine for chess.
>>>>>
>>>>>If badly programmed , then yes not much difference between a 8 proc box and a 64
>>>>>proc box (actually it can be lower performing!).
>>>>>Which is exactly my point , you need to design a program specifically to run on
>>>>>such a system - not expect something that works on a 2 or 4 proc system and
>>>>>expect it to work for a 64 proc system !
>>>>
>>>>The Alpha-Beta algorithm used for chess is a serial algorithm.  There's no
>>>>getting around that.  The more processors you use, the less efficiency you will
>>>>get, unless you use something else than Alpha-Beta.
>>>>
>>>>No matter how much you want to rewrite and "tweak" for a NUMA machine (or any
>>>>kind of machine, for that matter), adding more and more processors is simply
>>>>going to stop being beneficial at some point.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.