Author: Jeremiah Penery
Date: 22:00:22 08/27/03
Go up one level in this thread
On August 27, 2003 at 19:45:43, Robert Hyatt wrote: >On August 27, 2003 at 18:31:30, Jeremiah Penery wrote: > >>I don't see why NUMA should pose any problems. It's all handled by the hardware >>and the operating system anyway. > >Not really. If I put something in my local memory I can get to it much >faster than if it is in the local memory of _another_ processor. Lets >take a "split block" in crafty, which contains _all_ the search-critical >data. If it is not in my local memory for the processor using that split >block to do the search, performance dies miserably. Since my split blocks >are (at present) just an array of big structures, they are in contiguous >memory and they will exist in the local memory of only one processor. All >others will run dog slow, except for the one with the quick access. On a 2-way Opteron, accessing non-local memory should be at least as fast as accessing memory on a single-cpu P4 or Athlon system. For a 4-way Opteron, it still should not be worse, even if it requires 2 hops. >So yes, the hardware and O/S make it work, but it is up to the programmer to >make it work _efficiently_. In my case, I need to distribute "split blocks" >across processors, so that each processor has a few in its local memory. Then >when I need to give a processor something to do, I take the performance hit >(a short one) to copy from my local split block to his remote split block, >but then he runs like blazes with his local copy. Right now I don't have >the first hit, but the second is a killer since only one processor has any >local split blocks. > >That takes a design change to correct. One that is not needed on a non-NUMA >type architecture. There are other issues that also cause problems, such >as sharing data that causes lots of cache transactions to keep things coherent. Cache coherency is just as much a problem on SMP machines as on NUMA ones. >>And all of my friends' and my single-cpu boxes have 4-way interleaving also. > >It seems relatively pointless on a single-cpu machine, since cache is already >loaded in "burst mode". And that's the point of SDRAM/DDRAM/etc, to provide >the next N blocks much more quickly than the first block. > >On a dual or quad, it makes a great deal more sense... I don't claim to know why they do it, but only that it exists.
This page took 0.04 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.