Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: The need to unmake move

Author: Robert Hyatt

Date: 08:39:51 08/28/03

Go up one level in this thread


On August 28, 2003 at 01:00:22, Jeremiah Penery wrote:

>On August 27, 2003 at 19:45:43, Robert Hyatt wrote:
>
>>On August 27, 2003 at 18:31:30, Jeremiah Penery wrote:
>>
>>>I don't see why NUMA should pose any problems.  It's all handled by the hardware
>>>and the operating system anyway.
>>
>>Not really.  If I put something in my local memory I can get to it much
>>faster than if it is in the local memory of _another_ processor.  Lets
>>take a "split block" in crafty, which contains _all_ the search-critical
>>data.  If it is not in my local memory for the processor using that split
>>block to do the search, performance dies miserably.  Since my split blocks
>>are (at present) just an array of big structures, they are in contiguous
>>memory and they will exist in the local memory of only one processor.  All
>>others will run dog slow, except for the one with the quick access.
>
>On a 2-way Opteron, accessing non-local memory should be at least as fast as
>accessing memory on a single-cpu P4 or Athlon system.  For a 4-way Opteron, it
>still should not be worse, even if it requires 2 hops.

Perhaps.  But don't forget, when you have two cpus, 1/2 the memory _is_ slower
than the other half.  By some fixed latency.  A poor algorithm will definitely
perform slower than a good one, because the good one won't fight that extra
latency while the poor one will hit it all the time.


>
>>So yes, the hardware and O/S make it work, but it is up to the programmer to
>>make it work _efficiently_.  In my case, I need to distribute "split blocks"
>>across processors, so that each processor has a few in its local memory.  Then
>>when I need to give a processor something to do, I take the performance hit
>>(a short one) to copy from my local split block to his remote split block,
>>but then he runs like blazes with his local copy.  Right now I don't have
>>the first hit, but the second is a killer since only one processor has any
>>local split blocks.
>>
>>That takes a design change to correct.  One that is not needed on a non-NUMA
>>type architecture.  There are other issues that also cause problems, such
>>as sharing data that causes lots of cache transactions to keep things coherent.
>
>Cache coherency is just as much a problem on SMP machines as on NUMA ones.


no it isn't.  For the reasons NUMA memory access is more problematic than
pure SMP access.  The cache controllers have the _same_ latency issue.  A
cache controller "way over there" takes much longer to "snoop/invalidate"
than one "right next door."

So you run into the _same_ issue again.   The "farther apart" two processors
are, the less stuff you want to share in memory, because the cache coherency
problem is slower to handle...



>
>>>And all of my friends' and my single-cpu boxes have 4-way interleaving also.
>>
>>It seems relatively pointless on a single-cpu machine, since cache is already
>>loaded in "burst mode".  And that's the point of SDRAM/DDRAM/etc, to provide
>>the next N blocks much more quickly than the first block.
>>
>>On a dual or quad, it makes a great deal more sense...
>
>I don't claim to know why they do it, but only that it exists.



This page took 0.04 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.