Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Looks like a Crafty problem to me...

Author: Robert Hyatt
Date: 13:54:33 11/13/03
On November 13, 2003 at 16:12:27, Aaron Gordon wrote:

>On November 13, 2003 at 14:47:25, Eugene Nalimov wrote:
>
>>On November 13, 2003 at 12:43:38, Aaron Gordon wrote:
>>
>>>On November 13, 2003 at 12:10:24, Eugene Nalimov wrote:
>>>
>>>>On November 12, 2003 at 23:22:53, Aaron Gordon wrote:
>>>>
>>>>>On November 12, 2003 at 13:54:07, Eugene Nalimov wrote:
>>>>>
>>>>>>On November 12, 2003 at 11:55:20, Gian-Carlo Pascutto wrote:
>>>>>>
>>>>>>>On November 11, 2003 at 23:42:45, Eugene Nalimov wrote:
>>>>>>>
>>>>>>>>My point is: it's possible that due to the fact that quad Opteron is NUMA -- >not SMP -- system, for SMP-only program performance on quad Opteron can be
>>>>>>>>worse than on *real* quad SMP system, even when for one CPU Opteron
>>>>>>>>performance is much better. Itanium was used only as an example of such
>>>>>>>>system, I never recommended rewriting any program for it.
>>>>>>>
>>>>>>>I don't understand how. The NUMA part is RAM. Even worst case on the Opteron
>>>>>>>RAM is faster than Xeon SMP. So how could it ever be worse?
>>>>>>>
>>>>>>>--
>>>>>>>GCP
>>>>>>
>>>>>>I can think of several reasons why scaling is very bad if all the memory was
>>>>>>allocated at one CPU:
>>>>>>
>>>>>>(1) Memory *bandwidth*. All the memory requests go to exactly that CPU, so all
>>>>>>CPUs have to use exactly one (or two) channels to memory. On Xeons *worst case*
>>>>>>memory bandwidth is higher.
>>>>>>
>>>>>>(2) CPU-to-CPU *bandwidth* -- memory transfer speed is limited by the fact that
>>>>>>*one* CPU has to process memory requests for for *all* CPUs. Also notice that
>>>>>>for "normal" topology
>>>>>>
>>>>>>  0----1
>>>>>>  |    |
>>>>>>  |    |
>>>>>>  2----3
>>>>>>
>>>>>>CPU#3 has to go through either CPU#1 or CPU#2 to reach memory of CPU#0.
>>>>>>
>>>>>>(3) MOESI vs. MESI synchronisation protocols -- I was told that on MOESI (used
>>>>>>by AMD) traffic due to shared *modified* cache lines is much higher than on MESI
>>>>>>(used by Intel). If it is really so (I didn't investigated myself) it probably
>>>>>>can explain why on 32-bit Athlons Crafty prior to 19.5 scaled worse than on
>>>>>>Pentium 4.
>>>>>>
>>>>>>In any case here are results of Crafty 19.4 scaling on 2 different Opteron
>>>>>>systems, and on Itanium2 system (measured before Crafty became NUMA-aware, and
>>>>>>we decreased amount of shared modifiable data):
>>>>>>
>>>>>>Opteron system I:
>>>>>>2 CPUs:    1.57x
>>>>>>3 CPUs:    1.99x
>>>>>>4 CPUs:    1.98x
>>>>>>
>>>>>>Opteron system II:
>>>>>>2 CPUs:    1.61x
>>>>>>3 CPUs:    2.13x
>>>>>>4 CPUs:    2.35x
>>>>>>
>>>>>>Itanium2 system:
>>>>>>2 CPUs:    1.84x
>>>>>>3 CPUs:    2.63x
>>>>>>4 CPUs:    3.22x
>>>>>>
>>>>>>Crafty 19.5 scales much better. On Opteron system II it reaches 3.8x on 4P.
>>>>>>
>>>>>>Thanks,
>>>>>>Eugene
>>>>>
>>>>>So, are you saying it needs special NUMA code to get 'full' bandwidth and that
>>>>>it defaults to a single memory channel? Running Windows 2k, XP, and other SMP
>>>>>operating systems the Opteron *always* gets the full memory bandwidth across all
>>>>>of its cpus. Hardware test pages ran all kinds of reviews/tests and every single
>>>>>one showed a dual/quad pulls a ridiculous amount of bandwidth. Remove the chips
>>>>>and it goes back down accordingly.
>>>>
>>>>Can you please point me to the test that allocates all the memory on one CPU,
>>>>and then *all the CPUs* read and write to *that* memory? I am not talking about
>>>>SPECrate type of test where you just run several independent (or almost
>>>>independent) processes simultaneously.
>>>>
>>>>Thanks,
>>>>Eugene
>>>
>>>Fire up Sciencemark 2.0 (www.sciencemark.org) and/or Sisoft's Sandra
>>>(http://www.sisoftware.co.uk/index.html?dir=dload&location=sware_dl_x86&langx=en&a=).
>>
>>(1) Sandra is NUMA-aware, so it's pointless to run it. From FAQ on
>>www.sisoftware.co.uk:
>>
>>Q: Does Sandra detect NUMA systems?
>>A: Yes, Sandra 2003/SP2 (9.55) or later does support NUMA systems; you also need
>>Windows XP/2003 or later for proper NUMA support.
>>
>>(2) Sciencemark: till recently "Memory benchmark is run on processor 0 only;
>>this has the side effect of not being able to measure on NUMA and non-NUMA
>>systems the effect of accessing processor 1's memory latency.
>>"
>>
>>Thanks,
>>Eugene
>
>Then why does sciencemark show a marked increase in memory bandwidth with
>multiple cpus as does previous versions of Sisoft. Any ideas as to why this is?


Because they are using local memory addressing.  That's the point.  If they
understand what is going on, they will do local memory accesses on each CPU
so that the router-to-router delay is hidden.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.