Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: The need to unmake move

Author: Robert Hyatt

Date: 10:06:34 09/03/03

Go up one level in this thread


On September 03, 2003 at 00:07:37, Jeremiah Penery wrote:

>On September 02, 2003 at 22:54:49, Robert Hyatt wrote:
>
>>Maybe.  But I use threads.  And on NUMA threads are _bad_.  One example,
>>do you _really_ want to _share_ all the attack bitmap stuff?  That means it
>>is in one processor's local memory, but will be slow for all others.  What
>>about the instructions?  Same thing.
>
>After some thinking, it seems to me that the *average* memory access speed will
>be the same no matter where the data is placed, for anything intended to be
>shared between all processors (in a small NUMA configuration).


The point for the "Crafty algorithm" is that I rarely share things among
_all_ processors, except for the transposition/refutation table and pawn
hash table.

Split blocks are shared, but explaining the idea is not so easy.  But to
try:

When a single processor is searching, and notices that there are idle
processors, it takes its own split block, and copies the data to N new
split blocks, one per processor.  For all normal searching, each processor
uses only its own split block, except at the position where the split
occurred.  There the parent split block is accessed by all threads to get
the next move to search.  That is not a very frequent access.  And there,
there will be penalties that are acceptable.  But for the _rest_ of the
work each processor does, I used a local split block for each so that they
ran at max speed.  That was the main change...

Without that "fix" it ran very poorly.  There was so much non-local memory
traffic that performance was simply bad.  With the fix, things worked much
better.



>  The reason for
>this is because what is local to one processor will be non-local to all others.
>It doesn't matter if everything is local to the same processor or spread around,
>because the same percentage of total accesses will be non-local in any case
>(unless there is a disparity between the number of accesses each CPU is trying
>to accomplish).

That is _the_ point.  A single processor spends 99.99999% of its time accessing
its local "tree" structure.  It spends the other .00001% of the time accessing
the shared tree structure to get the next move at the split point.  If you make
_all_ the memory accesses non-local except for the one lucky processor that
has it all local, then things go bad, quickly.  You have a hot spot, and the
"bandwidth" Vincent likes to ramble about won't help one bit.  If every
processor beats on one processor's local memory, bandwidth is _very_ low.


>
>The only problem is that one processor's memory banks might get hammered, but
>that _is_ the same with an (similarly small) SMP configuration - all accesses go
>serially through one memory controller.

Yes, but that memory controller has about 4x the bandwidth of a one-cpu
type memory controller due to 4-way memory interleaving, so the loss is not
significant. 8-way boxes have a significant problem (say the 8-way dell xeon
server) as they don't try to recover by using 8-way interleaving.

>
>As machine size increases, of course, NUMA can run into more problems.  But then
>SMP has its own problems as well (cost and complexity of memory sub-system,
>mostly).

This is _all_ about price and scalability.  NUMA scales well, both in terms
of price per processor and bandwidth per processor.  But it does have some
significant "issues" that make programming more complex and less efficient.
Pure SMP boxes don't have the performance issues, but they don't scale to
large numbers of processors, either physically or with respect to price.




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.