Author: Robert Hyatt
Date: 09:08:03 09/03/03
Go up one level in this thread
On September 03, 2003 at 10:38:05, Vincent Diepeveen wrote: >On September 01, 2003 at 23:53:20, Robert Hyatt wrote: > >>On August 29, 2003 at 19:56:34, Jeremiah Penery wrote: >> >>>On August 29, 2003 at 18:40:51, Eugene Nalimov wrote: >>> >>>>On August 29, 2003 at 18:32:46, Jeremiah Penery wrote: >>>> >>>>>Of course I know that. My point is that with Opteron, even if you are accessing >>>>>non-local memory *always*, you are not accessing it slower than you would with, >>>>>say, a traditional SMP machine (2x Xeon, for instance). >>>>>Of course you can do a lot better - all I'm saying is that there's no way you're >>>>>going to be doing worse. >>>>> >>>>>Either way you win, even with a crappy NUMA algorithm. >>>> >>>>I am not so sure. With some NUMA implementations each memory bank has limited >>>>bandwith, so if you happened to allocate all the critical data in one node's >>>>memory you'll overload its memory controller. >>> >>>>I had seen a case where SMP application was blindly ported to a 32-CPUs NUMA >>>>system (8 nodes, 4 64-bit CPUs per node, 256Gb RAM total). Application run much >>>>slower on 32 CPUs than on single CPU. >>> >>>I'm not talking about "some NUMA implementations". I'm talking about 2-4 >>>processor Opteron implementation. It should never have any of the problems you >>>describe. Indeed, you can see from SPECRate that it scales very nearly as well >>>as Itanium, and that still with compilers/OS still not very NUMA aware or very >>>good for AMD64. >> >>Look at the SPEC programs. The look at _the_ problem I mentioned for Crafty. >>It is almost guaranteed that _all_ critical search data for _all_ threads will >>be allocated in a single processor's local memory. That is going to be a hot- >>spot and the fancy redundant memory controllers will _not_ be able to hide that. >> >>You can't do 4x memory reads to a single bank. Yet Crafty is going to demand >>just that. And performance is going to suffer. _significantly_. > >Oh i'm not sure about opteron, but at the quads of the origin3800 (each node is >a quad) you can do READS in parallel. Yes, but not to the _SAME_ bank of memory. Same problem on the Cray. But the Cray has _many_ banks of memory, which is why it costs way more than the Origin boxes. > >In diep i profit from this. > >But about local memory you are correct of course when talking about internode >traffic. > >>It is fixable. But it isn't fixed in the current implementation.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.