Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: Looks like a Crafty problem to me...

Author: Eugene Nalimov

Date: 09:10:24 11/13/03

On November 12, 2003 at 23:22:53, Aaron Gordon wrote:

>On November 12, 2003 at 13:54:07, Eugene Nalimov wrote:
>
>>On November 12, 2003 at 11:55:20, Gian-Carlo Pascutto wrote:
>>
>>>On November 11, 2003 at 23:42:45, Eugene Nalimov wrote:
>>>
>>>>My point is: it's possible that due to the fact that quad Opteron is NUMA -- >not SMP -- system, for SMP-only program performance on quad Opteron can be
>>>>worse than on *real* quad SMP system, even when for one CPU Opteron
>>>>performance is much better. Itanium was used only as an example of such
>>>>system, I never recommended rewriting any program for it.
>>>
>>>I don't understand how. The NUMA part is RAM. Even worst case on the Opteron
>>>RAM is faster than Xeon SMP. So how could it ever be worse?
>>>
>>>--
>>>GCP
>>
>>I can think of several reasons why scaling is very bad if all the memory was
>>allocated at one CPU:
>>
>>(1) Memory *bandwidth*. All the memory requests go to exactly that CPU, so all
>>CPUs have to use exactly one (or two) channels to memory. On Xeons *worst case*
>>memory bandwidth is higher.
>>
>>(2) CPU-to-CPU *bandwidth* -- memory transfer speed is limited by the fact that
>>*one* CPU has to process memory requests for for *all* CPUs. Also notice that
>>for "normal" topology
>>
>>  0----1
>>  |    |
>>  |    |
>>  2----3
>>
>>CPU#3 has to go through either CPU#1 or CPU#2 to reach memory of CPU#0.
>>
>>(3) MOESI vs. MESI synchronisation protocols -- I was told that on MOESI (used
>>by AMD) traffic due to shared *modified* cache lines is much higher than on MESI
>>(used by Intel). If it is really so (I didn't investigated myself) it probably
>>can explain why on 32-bit Athlons Crafty prior to 19.5 scaled worse than on
>>Pentium 4.
>>
>>In any case here are results of Crafty 19.4 scaling on 2 different Opteron
>>systems, and on Itanium2 system (measured before Crafty became NUMA-aware, and
>>we decreased amount of shared modifiable data):
>>
>>Opteron system I:
>>2 CPUs:    1.57x
>>3 CPUs:    1.99x
>>4 CPUs:    1.98x
>>
>>Opteron system II:
>>2 CPUs:    1.61x
>>3 CPUs:    2.13x
>>4 CPUs:    2.35x
>>
>>Itanium2 system:
>>2 CPUs:    1.84x
>>3 CPUs:    2.63x
>>4 CPUs:    3.22x
>>
>>Crafty 19.5 scales much better. On Opteron system II it reaches 3.8x on 4P.
>>
>>Thanks,
>>Eugene
>
>So, are you saying it needs special NUMA code to get 'full' bandwidth and that
>it defaults to a single memory channel? Running Windows 2k, XP, and other SMP
>operating systems the Opteron *always* gets the full memory bandwidth across all
>of its cpus. Hardware test pages ran all kinds of reviews/tests and every single
>one showed a dual/quad pulls a ridiculous amount of bandwidth. Remove the chips
>and it goes back down accordingly.

Can you please point me to the test that allocates all the memory on one CPU,
and then *all the CPUs* read and write to *that* memory? I am not talking about
SPECrate type of test where you just run several independent (or almost
independent) processes simultaneously.

Thanks,
Eugene

Re: Looks like a Crafty problem to me... Aaron Gordon 09:43:38 11/13/03
- Re: Looks like a Crafty problem to me... Eugene Nalimov 11:47:25 11/13/03
  - Re: Looks like a Crafty problem to me... Aaron Gordon 13:12:27 11/13/03
    - Re: Looks like a Crafty problem to me... Eugene Nalimov 14:26:53 11/13/03
    - Re: Looks like a Crafty problem to me... Robert Hyatt 13:54:33 11/13/03

This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.