Computer Chess Club Archives


Search

Terms

Messages

Subject: Looks like a Crafty problem to me...

Author: Aaron Gordon

Date: 20:22:53 11/12/03

Go up one level in this thread


On November 12, 2003 at 13:54:07, Eugene Nalimov wrote:

>On November 12, 2003 at 11:55:20, Gian-Carlo Pascutto wrote:
>
>>On November 11, 2003 at 23:42:45, Eugene Nalimov wrote:
>>
>>>My point is: it's possible that due to the fact that quad Opteron is NUMA -- >not SMP -- system, for SMP-only program performance on quad Opteron can be
>>>worse than on *real* quad SMP system, even when for one CPU Opteron
>>>performance is much better. Itanium was used only as an example of such
>>>system, I never recommended rewriting any program for it.
>>
>>I don't understand how. The NUMA part is RAM. Even worst case on the Opteron
>>RAM is faster than Xeon SMP. So how could it ever be worse?
>>
>>--
>>GCP
>
>I can think of several reasons why scaling is very bad if all the memory was
>allocated at one CPU:
>
>(1) Memory *bandwidth*. All the memory requests go to exactly that CPU, so all
>CPUs have to use exactly one (or two) channels to memory. On Xeons *worst case*
>memory bandwidth is higher.
>
>(2) CPU-to-CPU *bandwidth* -- memory transfer speed is limited by the fact that
>*one* CPU has to process memory requests for for *all* CPUs. Also notice that
>for "normal" topology
>
>  0----1
>  |    |
>  |    |
>  2----3
>
>CPU#3 has to go through either CPU#1 or CPU#2 to reach memory of CPU#0.
>
>(3) MOESI vs. MESI synchronisation protocols -- I was told that on MOESI (used
>by AMD) traffic due to shared *modified* cache lines is much higher than on MESI
>(used by Intel). If it is really so (I didn't investigated myself) it probably
>can explain why on 32-bit Athlons Crafty prior to 19.5 scaled worse than on
>Pentium 4.
>
>In any case here are results of Crafty 19.4 scaling on 2 different Opteron
>systems, and on Itanium2 system (measured before Crafty became NUMA-aware, and
>we decreased amount of shared modifiable data):
>
>Opteron system I:
>2 CPUs:    1.57x
>3 CPUs:    1.99x
>4 CPUs:    1.98x
>
>Opteron system II:
>2 CPUs:    1.61x
>3 CPUs:    2.13x
>4 CPUs:    2.35x
>
>Itanium2 system:
>2 CPUs:    1.84x
>3 CPUs:    2.63x
>4 CPUs:    3.22x
>
>Crafty 19.5 scales much better. On Opteron system II it reaches 3.8x on 4P.
>
>Thanks,
>Eugene

So, are you saying it needs special NUMA code to get 'full' bandwidth and that
it defaults to a single memory channel? Running Windows 2k, XP, and other SMP
operating systems the Opteron *always* gets the full memory bandwidth across all
of its cpus. Hardware test pages ran all kinds of reviews/tests and every single
one showed a dual/quad pulls a ridiculous amount of bandwidth. Remove the chips
and it goes back down accordingly.

This looks like a Crafty problem, NOT an Opteron problem. Why test software that
already has problems? Those speedup numbers for 2 cpus are identical to the dual
Athlon numbers.. which crafty also had problems with. Try Deep Sjeng for
example, see how that turns out. I'm sure Gian would send it to you for
speed-testing most likely if you don't have it already. I'd be willing to bet
the Opteron mops up the Xeons as I mentioned before. Want to give it a shot, at
least until Crafty is working properly?



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.