Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Looks like a Crafty problem to me...

Author: Aaron Gordon

Date: 13:12:27 11/13/03

Go up one level in this thread


On November 13, 2003 at 14:47:25, Eugene Nalimov wrote:

>On November 13, 2003 at 12:43:38, Aaron Gordon wrote:
>
>>On November 13, 2003 at 12:10:24, Eugene Nalimov wrote:
>>
>>>On November 12, 2003 at 23:22:53, Aaron Gordon wrote:
>>>
>>>>On November 12, 2003 at 13:54:07, Eugene Nalimov wrote:
>>>>
>>>>>On November 12, 2003 at 11:55:20, Gian-Carlo Pascutto wrote:
>>>>>
>>>>>>On November 11, 2003 at 23:42:45, Eugene Nalimov wrote:
>>>>>>
>>>>>>>My point is: it's possible that due to the fact that quad Opteron is NUMA -- >not SMP -- system, for SMP-only program performance on quad Opteron can be
>>>>>>>worse than on *real* quad SMP system, even when for one CPU Opteron
>>>>>>>performance is much better. Itanium was used only as an example of such
>>>>>>>system, I never recommended rewriting any program for it.
>>>>>>
>>>>>>I don't understand how. The NUMA part is RAM. Even worst case on the Opteron
>>>>>>RAM is faster than Xeon SMP. So how could it ever be worse?
>>>>>>
>>>>>>--
>>>>>>GCP
>>>>>
>>>>>I can think of several reasons why scaling is very bad if all the memory was
>>>>>allocated at one CPU:
>>>>>
>>>>>(1) Memory *bandwidth*. All the memory requests go to exactly that CPU, so all
>>>>>CPUs have to use exactly one (or two) channels to memory. On Xeons *worst case*
>>>>>memory bandwidth is higher.
>>>>>
>>>>>(2) CPU-to-CPU *bandwidth* -- memory transfer speed is limited by the fact that
>>>>>*one* CPU has to process memory requests for for *all* CPUs. Also notice that
>>>>>for "normal" topology
>>>>>
>>>>>  0----1
>>>>>  |    |
>>>>>  |    |
>>>>>  2----3
>>>>>
>>>>>CPU#3 has to go through either CPU#1 or CPU#2 to reach memory of CPU#0.
>>>>>
>>>>>(3) MOESI vs. MESI synchronisation protocols -- I was told that on MOESI (used
>>>>>by AMD) traffic due to shared *modified* cache lines is much higher than on MESI
>>>>>(used by Intel). If it is really so (I didn't investigated myself) it probably
>>>>>can explain why on 32-bit Athlons Crafty prior to 19.5 scaled worse than on
>>>>>Pentium 4.
>>>>>
>>>>>In any case here are results of Crafty 19.4 scaling on 2 different Opteron
>>>>>systems, and on Itanium2 system (measured before Crafty became NUMA-aware, and
>>>>>we decreased amount of shared modifiable data):
>>>>>
>>>>>Opteron system I:
>>>>>2 CPUs:    1.57x
>>>>>3 CPUs:    1.99x
>>>>>4 CPUs:    1.98x
>>>>>
>>>>>Opteron system II:
>>>>>2 CPUs:    1.61x
>>>>>3 CPUs:    2.13x
>>>>>4 CPUs:    2.35x
>>>>>
>>>>>Itanium2 system:
>>>>>2 CPUs:    1.84x
>>>>>3 CPUs:    2.63x
>>>>>4 CPUs:    3.22x
>>>>>
>>>>>Crafty 19.5 scales much better. On Opteron system II it reaches 3.8x on 4P.
>>>>>
>>>>>Thanks,
>>>>>Eugene
>>>>
>>>>So, are you saying it needs special NUMA code to get 'full' bandwidth and that
>>>>it defaults to a single memory channel? Running Windows 2k, XP, and other SMP
>>>>operating systems the Opteron *always* gets the full memory bandwidth across all
>>>>of its cpus. Hardware test pages ran all kinds of reviews/tests and every single
>>>>one showed a dual/quad pulls a ridiculous amount of bandwidth. Remove the chips
>>>>and it goes back down accordingly.
>>>
>>>Can you please point me to the test that allocates all the memory on one CPU,
>>>and then *all the CPUs* read and write to *that* memory? I am not talking about
>>>SPECrate type of test where you just run several independent (or almost
>>>independent) processes simultaneously.
>>>
>>>Thanks,
>>>Eugene
>>
>>Fire up Sciencemark 2.0 (www.sciencemark.org) and/or Sisoft's Sandra
>>(http://www.sisoftware.co.uk/index.html?dir=dload&location=sware_dl_x86&langx=en&a=).
>
>(1) Sandra is NUMA-aware, so it's pointless to run it. From FAQ on
>www.sisoftware.co.uk:
>
>Q: Does Sandra detect NUMA systems?
>A: Yes, Sandra 2003/SP2 (9.55) or later does support NUMA systems; you also need
>Windows XP/2003 or later for proper NUMA support.
>
>(2) Sciencemark: till recently "Memory benchmark is run on processor 0 only;
>this has the side effect of not being able to measure on NUMA and non-NUMA
>systems the effect of accessing processor 1's memory latency.
>"
>
>Thanks,
>Eugene

Then why does sciencemark show a marked increase in memory bandwidth with
multiple cpus as does previous versions of Sisoft. Any ideas as to why this is?



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.