Author: Aaron Gordon
Date: 20:22:53 11/12/03
Go up one level in this thread
On November 12, 2003 at 13:54:07, Eugene Nalimov wrote: >On November 12, 2003 at 11:55:20, Gian-Carlo Pascutto wrote: > >>On November 11, 2003 at 23:42:45, Eugene Nalimov wrote: >> >>>My point is: it's possible that due to the fact that quad Opteron is NUMA -- >not SMP -- system, for SMP-only program performance on quad Opteron can be >>>worse than on *real* quad SMP system, even when for one CPU Opteron >>>performance is much better. Itanium was used only as an example of such >>>system, I never recommended rewriting any program for it. >> >>I don't understand how. The NUMA part is RAM. Even worst case on the Opteron >>RAM is faster than Xeon SMP. So how could it ever be worse? >> >>-- >>GCP > >I can think of several reasons why scaling is very bad if all the memory was >allocated at one CPU: > >(1) Memory *bandwidth*. All the memory requests go to exactly that CPU, so all >CPUs have to use exactly one (or two) channels to memory. On Xeons *worst case* >memory bandwidth is higher. > >(2) CPU-to-CPU *bandwidth* -- memory transfer speed is limited by the fact that >*one* CPU has to process memory requests for for *all* CPUs. Also notice that >for "normal" topology > > 0----1 > | | > | | > 2----3 > >CPU#3 has to go through either CPU#1 or CPU#2 to reach memory of CPU#0. > >(3) MOESI vs. MESI synchronisation protocols -- I was told that on MOESI (used >by AMD) traffic due to shared *modified* cache lines is much higher than on MESI >(used by Intel). If it is really so (I didn't investigated myself) it probably >can explain why on 32-bit Athlons Crafty prior to 19.5 scaled worse than on >Pentium 4. > >In any case here are results of Crafty 19.4 scaling on 2 different Opteron >systems, and on Itanium2 system (measured before Crafty became NUMA-aware, and >we decreased amount of shared modifiable data): > >Opteron system I: >2 CPUs: 1.57x >3 CPUs: 1.99x >4 CPUs: 1.98x > >Opteron system II: >2 CPUs: 1.61x >3 CPUs: 2.13x >4 CPUs: 2.35x > >Itanium2 system: >2 CPUs: 1.84x >3 CPUs: 2.63x >4 CPUs: 3.22x > >Crafty 19.5 scales much better. On Opteron system II it reaches 3.8x on 4P. > >Thanks, >Eugene So, are you saying it needs special NUMA code to get 'full' bandwidth and that it defaults to a single memory channel? Running Windows 2k, XP, and other SMP operating systems the Opteron *always* gets the full memory bandwidth across all of its cpus. Hardware test pages ran all kinds of reviews/tests and every single one showed a dual/quad pulls a ridiculous amount of bandwidth. Remove the chips and it goes back down accordingly. This looks like a Crafty problem, NOT an Opteron problem. Why test software that already has problems? Those speedup numbers for 2 cpus are identical to the dual Athlon numbers.. which crafty also had problems with. Try Deep Sjeng for example, see how that turns out. I'm sure Gian would send it to you for speed-testing most likely if you don't have it already. I'd be willing to bet the Opteron mops up the Xeons as I mentioned before. Want to give it a shot, at least until Crafty is working properly?
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.