Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Looks like a Crafty problem to me...

Author: Robert Hyatt
Date: 11:41:56 11/13/03
On November 13, 2003 at 12:50:09, Aaron Gordon wrote:

>On November 13, 2003 at 10:13:14, Robert Hyatt wrote:
>
>>On November 12, 2003 at 23:22:53, Aaron Gordon wrote:
>>
>>>On November 12, 2003 at 13:54:07, Eugene Nalimov wrote:
>>>
>>>>On November 12, 2003 at 11:55:20, Gian-Carlo Pascutto wrote:
>>>>
>>>>>On November 11, 2003 at 23:42:45, Eugene Nalimov wrote:
>>>>>
>>>>>>My point is: it's possible that due to the fact that quad Opteron is NUMA -- >not SMP -- system, for SMP-only program performance on quad Opteron can be
>>>>>>worse than on *real* quad SMP system, even when for one CPU Opteron
>>>>>>performance is much better. Itanium was used only as an example of such
>>>>>>system, I never recommended rewriting any program for it.
>>>>>
>>>>>I don't understand how. The NUMA part is RAM. Even worst case on the Opteron
>>>>>RAM is faster than Xeon SMP. So how could it ever be worse?
>>>>>
>>>>>--
>>>>>GCP
>>>>
>>>>I can think of several reasons why scaling is very bad if all the memory was
>>>>allocated at one CPU:
>>>>
>>>>(1) Memory *bandwidth*. All the memory requests go to exactly that CPU, so all
>>>>CPUs have to use exactly one (or two) channels to memory. On Xeons *worst case*
>>>>memory bandwidth is higher.
>>>>
>>>>(2) CPU-to-CPU *bandwidth* -- memory transfer speed is limited by the fact that
>>>>*one* CPU has to process memory requests for for *all* CPUs. Also notice that
>>>>for "normal" topology
>>>>
>>>>  0----1
>>>>  |    |
>>>>  |    |
>>>>  2----3
>>>>
>>>>CPU#3 has to go through either CPU#1 or CPU#2 to reach memory of CPU#0.
>>>>
>>>>(3) MOESI vs. MESI synchronisation protocols -- I was told that on MOESI (used
>>>>by AMD) traffic due to shared *modified* cache lines is much higher than on MESI
>>>>(used by Intel). If it is really so (I didn't investigated myself) it probably
>>>>can explain why on 32-bit Athlons Crafty prior to 19.5 scaled worse than on
>>>>Pentium 4.
>>>>
>>>>In any case here are results of Crafty 19.4 scaling on 2 different Opteron
>>>>systems, and on Itanium2 system (measured before Crafty became NUMA-aware, and
>>>>we decreased amount of shared modifiable data):
>>>>
>>>>Opteron system I:
>>>>2 CPUs:    1.57x
>>>>3 CPUs:    1.99x
>>>>4 CPUs:    1.98x
>>>>
>>>>Opteron system II:
>>>>2 CPUs:    1.61x
>>>>3 CPUs:    2.13x
>>>>4 CPUs:    2.35x
>>>>
>>>>Itanium2 system:
>>>>2 CPUs:    1.84x
>>>>3 CPUs:    2.63x
>>>>4 CPUs:    3.22x
>>>>
>>>>Crafty 19.5 scales much better. On Opteron system II it reaches 3.8x on 4P.
>>>>
>>>>Thanks,
>>>>Eugene
>>>
>>>So, are you saying it needs special NUMA code to get 'full' bandwidth and that
>>>it defaults to a single memory channel? Running Windows 2k, XP, and other SMP
>>>operating systems the Opteron *always* gets the full memory bandwidth across all
>>>of its cpus.
>>
>>No, you are greatly missing the point.  On a 4-way NUMA box, it is possible
>>that all critical data is in the memory of one CPU.  There  is no way you can
>>get "full memory bandwidth" when all four cpus are beating on that one
>>router to get access to that local memory.  It has become a "hot spot".  This
>>is a common NUMA problem that is well-known.  It isn't just an Opteron
>>issue.  It is an IA64 issue, or a CM-x issue, or whatever.  Any machine with
>>NUMA has this potential issue.
>>
>>If you think that four cpus can suck data out of one local memory at 4x the
>>speed one cpu can suck data, you are wrong.
>
>You talking speedups (like 3.5x != true 4x) or even close to 4x? There have been
>a few benchmarks, I believe with lmbench, Sisoft sandra, and Sciencemark,
>showing a near linear improvement in memory bandwidth by adding multiple
>processors. It may be using all processors to get that speedup but it was
>definitely pulling some major bandwidth.

Yes, but again, not the way I was describing.  Put the data on one block of
local memory.  Now you won't see that major bandwidth.  One router can not
possibly keep up with three other routers asking it for data, plus its
local CPU also asking for memory.

That's the NUMA weakness...


>
>The original arguement however was that the Quad Opteron would be better on Deep
>Fritz 8 than the Xeons would be. Since a Dual Opteron and Dual Athlon exibit no
>abnormal slowdowns as Crafty did I assumed it wouldn't suffer from the slowdowns
>you and Nalimov have been speaking off. It may happen in other programs, yes,
>but for Deep Fritz I don't see this happening.

It depends on how deep fritz does the parallel stuff.  It might be lucky.
It might not be...  But there are _definitely_ issues that have to be
explicitly programmed for to take full advantage of the Opteron...


>
>
>>AMD doesn't make that clear but
>>benchmark tests show it clearly.  The only way to get 4x the total bandwidth
>>is for each CPU to access data from one specific local memory so that all
>>local memory is busy.  Best performance is where each cpu mainly accesses its
>>own local memory.  That produces minimum latency and max bandwidth.
>
>
>
>>> Hardware test pages ran all kinds of reviews/tests and every single
>>>one showed a dual/quad pulls a ridiculous amount of bandwidth. Remove the chips
>>>and it goes back down accordingly.
>>>
>>>This looks like a Crafty problem, NOT an Opteron problem. Why test software that
>>>already has problems? Those speedup numbers for 2 cpus are identical to the dual
>>>Athlon numbers.. which crafty also had problems with. Try Deep Sjeng for
>>>example, see how that turns out. I'm sure Gian would send it to you for
>>>speed-testing most likely if you don't have it already. I'd be willing to bet
>>>the Opteron mops up the Xeons as I mentioned before. Want to give it a shot, at
>>>least until Crafty is working properly?
>>
>>Crafty is/was working just fine.  AMD's cache coherency is based on MOESI
>>rather than MESI.  This is a trade-off.  Sun was the first I saw that used
>>MOESI (although someone else may well have done it earlier for all I know).
>>
>>This coherency protocol is a trade-off that lets well-behaved programs run a
>>bit faster than poorly behaved programs.  Well-behaved here means that a
>>processor mainly beats on local memory.  MESI does slightly worse for well-
>>behaved, but significantly better for poorly-behaved.  You can see the
>>trade-off.
>>
>>Once we found out what was causing the cache problem, it was not hard to fix.
>>I'd bet that most every chess program will have this same problem, because
>>the thing I was doing is _very_ common.  Of course it only affects parallel
>>programs to start with.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.