Author: Eugene Nalimov
Date: 11:47:25 11/13/03
Go up one level in this thread
On November 13, 2003 at 12:43:38, Aaron Gordon wrote: >On November 13, 2003 at 12:10:24, Eugene Nalimov wrote: > >>On November 12, 2003 at 23:22:53, Aaron Gordon wrote: >> >>>On November 12, 2003 at 13:54:07, Eugene Nalimov wrote: >>> >>>>On November 12, 2003 at 11:55:20, Gian-Carlo Pascutto wrote: >>>> >>>>>On November 11, 2003 at 23:42:45, Eugene Nalimov wrote: >>>>> >>>>>>My point is: it's possible that due to the fact that quad Opteron is NUMA -- >not SMP -- system, for SMP-only program performance on quad Opteron can be >>>>>>worse than on *real* quad SMP system, even when for one CPU Opteron >>>>>>performance is much better. Itanium was used only as an example of such >>>>>>system, I never recommended rewriting any program for it. >>>>> >>>>>I don't understand how. The NUMA part is RAM. Even worst case on the Opteron >>>>>RAM is faster than Xeon SMP. So how could it ever be worse? >>>>> >>>>>-- >>>>>GCP >>>> >>>>I can think of several reasons why scaling is very bad if all the memory was >>>>allocated at one CPU: >>>> >>>>(1) Memory *bandwidth*. All the memory requests go to exactly that CPU, so all >>>>CPUs have to use exactly one (or two) channels to memory. On Xeons *worst case* >>>>memory bandwidth is higher. >>>> >>>>(2) CPU-to-CPU *bandwidth* -- memory transfer speed is limited by the fact that >>>>*one* CPU has to process memory requests for for *all* CPUs. Also notice that >>>>for "normal" topology >>>> >>>> 0----1 >>>> | | >>>> | | >>>> 2----3 >>>> >>>>CPU#3 has to go through either CPU#1 or CPU#2 to reach memory of CPU#0. >>>> >>>>(3) MOESI vs. MESI synchronisation protocols -- I was told that on MOESI (used >>>>by AMD) traffic due to shared *modified* cache lines is much higher than on MESI >>>>(used by Intel). If it is really so (I didn't investigated myself) it probably >>>>can explain why on 32-bit Athlons Crafty prior to 19.5 scaled worse than on >>>>Pentium 4. >>>> >>>>In any case here are results of Crafty 19.4 scaling on 2 different Opteron >>>>systems, and on Itanium2 system (measured before Crafty became NUMA-aware, and >>>>we decreased amount of shared modifiable data): >>>> >>>>Opteron system I: >>>>2 CPUs: 1.57x >>>>3 CPUs: 1.99x >>>>4 CPUs: 1.98x >>>> >>>>Opteron system II: >>>>2 CPUs: 1.61x >>>>3 CPUs: 2.13x >>>>4 CPUs: 2.35x >>>> >>>>Itanium2 system: >>>>2 CPUs: 1.84x >>>>3 CPUs: 2.63x >>>>4 CPUs: 3.22x >>>> >>>>Crafty 19.5 scales much better. On Opteron system II it reaches 3.8x on 4P. >>>> >>>>Thanks, >>>>Eugene >>> >>>So, are you saying it needs special NUMA code to get 'full' bandwidth and that >>>it defaults to a single memory channel? Running Windows 2k, XP, and other SMP >>>operating systems the Opteron *always* gets the full memory bandwidth across all >>>of its cpus. Hardware test pages ran all kinds of reviews/tests and every single >>>one showed a dual/quad pulls a ridiculous amount of bandwidth. Remove the chips >>>and it goes back down accordingly. >> >>Can you please point me to the test that allocates all the memory on one CPU, >>and then *all the CPUs* read and write to *that* memory? I am not talking about >>SPECrate type of test where you just run several independent (or almost >>independent) processes simultaneously. >> >>Thanks, >>Eugene > >Fire up Sciencemark 2.0 (www.sciencemark.org) and/or Sisoft's Sandra >(http://www.sisoftware.co.uk/index.html?dir=dload&location=sware_dl_x86&langx=en&a=). (1) Sandra is NUMA-aware, so it's pointless to run it. From FAQ on www.sisoftware.co.uk: Q: Does Sandra detect NUMA systems? A: Yes, Sandra 2003/SP2 (9.55) or later does support NUMA systems; you also need Windows XP/2003 or later for proper NUMA support. (2) Sciencemark: till recently "Memory benchmark is run on processor 0 only; this has the side effect of not being able to measure on NUMA and non-NUMA systems the effect of accessing processor 1's memory latency. " Thanks, Eugene
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.