Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Conclusion

Author: Vincent Diepeveen
Date: 12:51:20 12/28/03
On December 27, 2003 at 13:59:12, Robert Hyatt wrote:

Cut the nonsense Bob,
the opteron has a 3.5 faster latency local than your quad Xeon,
and main memory access latency is one of the bigger problems of Crafty,
so of course anything works well at that hardware SMP.

Your thing is SMP, not NUMA. Don't confuse the 2 things. Your thing won't run
512 processor SGI origin3800 of course with 5.8 us latency of course.

Note this is a very good latency.

Clusters that are in top5 of top500.org have around 10-20 us latency.

Only T3E is a few microseconds less than this which is of course $$$$$$$$$. Fill
in digits 0..9 but don't start with a 0 :)

Your dual Xeon has about 400ns latency, my dual K7 of course too (both 133Mhz
RAM with a chipset in between so no big surprises there that it is the same).

However the Opteron has around 120 ns latency to local memory and only a very
slightly more dual and quad.

So your old 'smp' code when the cache line length gets taken into account will
perform also a lot better than your current code. Of course some code that binds
threads to a certain CPU is nice to have. That's as far as i know all Nalimov
did for you, perhaps he wants to comment on that here...

Calling crafty NUMA is the biggest nonsense i have ever heard.

Here is what happens if you will run on a big machine with a 16+ cpu's and not
shared memory bus.

Note that a 2080 processor IBM opteron 2.0 Ghz is $10 million or in those
ranges, a 1000 processor itanium2 is around $10 million and a 16 processor
shared memory bus alpha is also $10 million.

So taking shared memory bus machines here as example is not a good idea.

If your thing would be numa it would run easily 100+ processors.

Crafty will hammer with all cpu's onto the same datastructure of course
allocated at cpu0, globally locking everything.

So the more processors you add the more it will die.

At the origin3800 or altix3000 you can easily see this with crafty when running
interactive, because when too much gets locked, the total system time eaten
isn't close to X when running at X processors.

It dies simply :)

>On December 27, 2003 at 04:58:51, Mridul Muralidharan wrote:
>
>>On December 26, 2003 at 19:29:57, Robert Hyatt wrote:
>>
>>>On December 26, 2003 at 18:40:57, Mridul Muralidharan wrote:
>>>
>>>>On December 26, 2003 at 16:13:26, Luis Smith wrote:
>>>>
>>>>>On December 26, 2003 at 15:34:43, Darren Rushton wrote:
>>>>>
>>>>>>>Actually what happens, is the 366 is SLOW.  And I mean SLOW.
>>>>>>
>>>>>>I don't intend to be controversial here, but the conclusion I draw from your
>>>>>>results is that Shredder 7 is such a brilliant program it is almost a match for
>>>>>>the one of the better amateur programs on hardware that's almost 10 times
>>>>>>slower.
>>>>>>
>>>>>>Regards,
>>>>>>
>>>>>>Darren
>>>>>
>>>>>I think you're missing the point of these experiments.  Some people here were
>>>>>saying that Crafty isn't a world contender.  Bob could get much better hardware
>>>>>than most of the commercials.  He mentioned something about a 32 way box.  Can
>>>>>you imagine the speed of Crafty on something like that?
>>>>>
>>>>>I don't think anyone can count Crafty out after this experiment.
>>>>
>>>>You also need to get decent speedup on those boxes :)
>>>>And I hope it is not a shared bus 32 proc bus ;)
>>>>
>>>>A 4/8 cpu opteron against a 32 proc alpha is not fair - for crafty - it would
>>>>lose again against say shredder or fritz.
>>>>
>>>>Mridul
>>>
>>>
>>>I'm not sure what you are saying in the above?  The 4-8cpu opteron will not
>>>run programs well without some work.  I've already done it.  Just dropping
>>>in deep fritz or something similar will not produce great results, from past
>>>experience.  As far as 32 proc alpha, it depends on the box.  I got reasonable
>>>scaling on the 32 cpu version I used last year at Compaq.
>>
>>4 things are important here.
>>
>>1) If deep fritz/shredder/etc gets released which supports quad/8 way opteron -
>>then it will be ported and tested. And the authors will ensure that there is a
>>decent speedup.
>>Dont tell me that they are never going to figure out how to get their program
>>working on a numa opteron box :) - Nalimov could have a good job at crafty - but
>>even other people would figure out what to do from their specs and docs.
>>And I have a suspicion that some already have ;)
>
>I didn't suggest that at all.  But the question seemed to be based on _today_
>not _next year_.  Today's SMP programs need some changes for the Opteron or
>they will run into some interesting cache and memory reference problems.  None
>are hard to fix.  But they _do_ have to be fixed for reasonable results...
>
>
>
>
>>
>>2) The alpha proc has a disadvantage in latency and processing power w.r.t the
>>opteron - so it is never a 1:4 or 1:8 h/w advantage between the two machines -
>>much lower.
>
>
>It depends on the program and the programmer most likely.  I have not run
>on recent alphas, but I ran on a 21264 at 666mhz a while back and was getting
>around 1M nps.  The last time I ran on a 16-way alpha, the NPS scaled at
>something around 14X+, I will have to see if I can dig up the old logs.  A
>group of doctors bought such a machine here about 2-3 years ago and I ran
>on it on ICC for a dozen games or so one afternoon.  This was a 21164
>machine with 16 cpus, and a single CPU was doing about 500K, the 16-way
>box was doing about 7M (SJLIM watched a few games so he might remember
>the actual numbers, but I don't have any logs myself).  That was OK scaling.
>
>
>
>>Also - what is crafty speedup here ? Any numbers ? What kind of machine is it ?
>>Shared bus ? - in which case you are dead due to bus contention.
>>numa ? - I thought you said crafty works only on windows and intel/amd. You have
>>crafty working for alpha also ?
>
>
>If you have been following the discussions here, you might recall that I
>was working with Compaq last year on an alpha-based NUMA version of Crafty.
>That is why it was so quick to get a NUMA version of Crafty ready again this
>year when the Opteron idea came up.  I had already done it once, although
>the alpha I had here lost the disk drive and all the source changes.
>Fortunately the changes were not that drastic, although I never completed
>_all_ the things that needed doing.  On the 32-way box I was seeing about
>13X faster _searches_. (I am not talking about NPS here but pure time to
>solution).  It was lower than the number I wanted to see, but it needed some
>program changes to further improve it.  IE on a Cray T932, the NPS scales
>by almost exactly 32X.  The speedup was closer to 18x than the predicted 22x
>using my given formula.  However, Crafty doesn't do vectors like Cray Blitz
>did so there was more to be had from that machine that I didn't try to get...
>
>
>
>
>
>>
>>3) A program scaling at 4 or 8 proc is going to be much higher than at 16 or 32.
>
>
>Again, that is not a statement you can make without a qualifier.  IE CB
>scaled perfectly at 32.  Whatever it got at 16, it got 2x that at 32.  My
>point is that your statement may or may not be true.  It depends on the
>architecture, and what memory looks like.  IE clearly for NUMA boxes, the
>scaling is going to be worse than for a machine based on a pure cross-bar
>like the Crays.  How much worse is a subject for great debate.  IE I have
>some code in Crafty that was designed for a machine that had multiple
>processors per "node".  Hence my idea about "processor groups" in Crafty.
>It has not been fully tweaked and tuned, but the code has been there for
>several years to better fit machines where some processors work together
>"better" than others because of the "node" concept.
>
>So a lot depends on the architecture.  A lot more depends on the program
>and the programmer's understanding of the architecture and what makes it look
>good or bad.
>
>Theoretically, _nothing_ prevents a program from scaling to 1024 processors.
>Practically, it is a real challenge.  But, unlike our resident NUMA expert,
>I'm not about to write NUMA and clusters and NUMA clusters off.  I think the
>problems are significant, but hardly unsolvable...
>
>
>
>
>>
>>4) Even with a 1:10 or 1:8 advantage crafty only barely manages to catch up or
>>beat these top order programs - so not much of a chance if they show up with the
>>above mentioned machines.
>
>That was what was not so clear in your post.  If you assume opteron-optimized
>fritz, vs Crafty on a significantly bigger box, maybe.  But the concept we are
>talking about was crafty at the 2003 WCCC event, not crafty in 2 years.  That
>means a big opteron, itanium or alpha machine vs a 4-way or 8-way xeon.  ANd
>from experience, the 8-way xeons are _not_ very good.  They use the same
>memory system as the 4-way boxes, which means 2x the cpus, 1x the memory band-
>width.  Not a good mix for programs that really have a high memory bandwidth
>requirement like chess engines with their big hash tables which runs afoul of
>PIV's with their long L2 cache lines, and the corresponding cache conflicts
>that arise.
>
>So, again, the question was "did I not go because Crafty had no chance?"  The
>answer is clearly "no".  Crafty beat Rebel pretty handily at 8:1.  It scraped
>by Junior.  It may or may not beat Shredder.  But, it _is_ competitive.  It
>had chances.  That's the point here.
>
>Given another year of NUMA activity, it will be _more_ competitive.
>
>>
>>Mridul
Re: Conclusion Robert Hyatt 17:10:31 12/28/03
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.