Author: Robert Hyatt
Date: 08:39:51 08/28/03
Go up one level in this thread
On August 28, 2003 at 01:00:22, Jeremiah Penery wrote: >On August 27, 2003 at 19:45:43, Robert Hyatt wrote: > >>On August 27, 2003 at 18:31:30, Jeremiah Penery wrote: >> >>>I don't see why NUMA should pose any problems. It's all handled by the hardware >>>and the operating system anyway. >> >>Not really. If I put something in my local memory I can get to it much >>faster than if it is in the local memory of _another_ processor. Lets >>take a "split block" in crafty, which contains _all_ the search-critical >>data. If it is not in my local memory for the processor using that split >>block to do the search, performance dies miserably. Since my split blocks >>are (at present) just an array of big structures, they are in contiguous >>memory and they will exist in the local memory of only one processor. All >>others will run dog slow, except for the one with the quick access. > >On a 2-way Opteron, accessing non-local memory should be at least as fast as >accessing memory on a single-cpu P4 or Athlon system. For a 4-way Opteron, it >still should not be worse, even if it requires 2 hops. Perhaps. But don't forget, when you have two cpus, 1/2 the memory _is_ slower than the other half. By some fixed latency. A poor algorithm will definitely perform slower than a good one, because the good one won't fight that extra latency while the poor one will hit it all the time. > >>So yes, the hardware and O/S make it work, but it is up to the programmer to >>make it work _efficiently_. In my case, I need to distribute "split blocks" >>across processors, so that each processor has a few in its local memory. Then >>when I need to give a processor something to do, I take the performance hit >>(a short one) to copy from my local split block to his remote split block, >>but then he runs like blazes with his local copy. Right now I don't have >>the first hit, but the second is a killer since only one processor has any >>local split blocks. >> >>That takes a design change to correct. One that is not needed on a non-NUMA >>type architecture. There are other issues that also cause problems, such >>as sharing data that causes lots of cache transactions to keep things coherent. > >Cache coherency is just as much a problem on SMP machines as on NUMA ones. no it isn't. For the reasons NUMA memory access is more problematic than pure SMP access. The cache controllers have the _same_ latency issue. A cache controller "way over there" takes much longer to "snoop/invalidate" than one "right next door." So you run into the _same_ issue again. The "farther apart" two processors are, the less stuff you want to share in memory, because the cache coherency problem is slower to handle... > >>>And all of my friends' and my single-cpu boxes have 4-way interleaving also. >> >>It seems relatively pointless on a single-cpu machine, since cache is already >>loaded in "burst mode". And that's the point of SDRAM/DDRAM/etc, to provide >>the next N blocks much more quickly than the first block. >> >>On a dual or quad, it makes a great deal more sense... > >I don't claim to know why they do it, but only that it exists.
This page took 0.04 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.