Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Intel four-way 2.8 Ghz system is just Amazing ! - Not hardly

Author: Robert Hyatt
Date: 07:21:08 11/13/03
On November 13, 2003 at 07:56:11, Gerd Isenberg wrote:

>On November 12, 2003 at 15:18:26, Robert Hyatt wrote:
>
>>On November 12, 2003 at 14:50:32, Russell Reagan wrote:
>>
>>>I am not sure I understand how NUMA works compared to SMP. I have gotten the
>>>impression from previous discussions of NUMA that it is inferior to SMP, but
>>>from what I've been able to find on the net, it sounds like they can be
>>>complimentary to one another, so I'm confused.
>>>
>>>Are they exclusively different or is NUMA an addition to SMP?
>>
>>NUMA == Non-Uniform Memory Architecture
>>
>>SMP == Symmetric MultiProcessing.
>>
>>In a SMP box, all processors are connected to memory using the same
>>datapath.  To help 4-way boxes avoid memory bottlenecks, most 4-way
>>boxes use 4-way memory interleaving, spreading consecutive 8-byte
>>chunks of memory across consecutive banks so that the reads can be
>>done in parallel.  IE a good 4-way box has 4x the potential memory
>>bandwidth of a 1-way box, assuming Intel or AMD prior to the Opteron.
>>
>>In a NUMA box, each processor has local memory, but each can see all
>>of the other memory via routers.  The problem comes to light when all
>>four processors want to access memory that is local to one CPU.  The
>>requests become serialized and that kills performance.  The issue is to
>>prevent this from happening.  That's what I had to do for the Compaq
>>NUMA-alpha box last year.  That's what Eugene and I re-did for Crafty
>>version 19.5 to make it work reasonably on the Opteron NUMA boxes.
>>
>>Opteron potentially has a higher memory bandwidth.  But, as always, potential
>>and practical don't mix.  When all processors try to beat on the memory attached
>>to the same physical processor, it produces a huge hot-spot with a very bad
>>delay.  Cache coherency has the same problems, as when I have a line of memory
>>in one cache controller, and that gets modified in another cache controller,
>>then we have a _lot_ of cache-controller to cache-controller "noise" to handle
>>that.  On a 4-way box, CPU A can quickly address its own local memory.  It
>>can not-so-quickly address memory on the two CPUs it is directly connected to.
>>It can even-slower address memory on the last processor as it has to go through
>>one of the two it is directly connected with first, which adds an extra hop.
>>The goal is to have everything a single processor needs reside in its local
>>memory.  Then you avoid the NUMA overhead for most accesses and it runs like
>>a bat out of Egypt.
>>
>>>
>>>In order for a process to take advantage of SMP, it must split the work into two
>>>threads. What changes must be made in order for a program to be NUMA aware?
>>
>>
>>See above.  Each processor needs important stuff kept in local memory rather
>>than in memory that is one (or more) hops away to another CPU.
>
>Hi Bob,
>
>as an SMP/NUMA-novice i have some questions too.
>When you do Multithreading on Opterons N-way boxes, each thread has it's own
>local memory for automatic / local variables on it's own stack close to the
>processor.

Correct.  This is something Eugene fixed.  He found a way to first say
"this thread must run only on this processor" and he does that before the
thread ever starts, so that local data (the stack in your example) gets
allocated on that processor's local memory.


>
>What about (shared) global memory? Is it allocated in the local memory of the
>processor attached to the first (main) thread, and therefore other threads have
>to access it via hypertransport protocol with those relative huge latencies?

It is allocated (at least in Windows I believe) on the local memory of the
processor doing the malloc() (or shmget() or whatever you choose to use).
The trick (for Crafty) was to put off creating the "split blocks" until a
thread was spawned, letting that thread malloc() it's own split blocks which
sticks them in that particular processor's local memory.  Of course setting
processor affinity is critical to keep the thread with that processor.






>
>Is there a way to allocate thread local memory, and to duplicate constant shared
>data like rotated lookup tables at startup time, to have it close to each thread
>using it?

Duplicating things automatically isn't doable.  Of course you could create
the initial rotated lookup tables as always, then have each thread malloc()
and copy the initial tables to local memory.  I talked this over with Eugene
and he felt that the L2 cache size is enough on the opterons and IA64
processors so that this is not necessary.  IE the same thing could be done
for the executable code, which makes starting separate instances of the
executable favorable for NUMA (not using fork() which does not duplicate
executable pages at all).  The same idea holds there, getting all the code
into a processor's local memory to keep instruction fetches off the routers.






>
>What about transposition tables and NUMA?

Yet another problem.  We are simply distributing the hash across all
processors equally.  There is probably a better approach, but this does
work.  And if you eliminate the other router traffic, hopefully the
transposition lookups won't be a bottleneck.



>May one consider a two level table approach, e.g. one huge shared main
>transposition table, only accessed by search threads if remaining search depth
>is greater/equal some threshold and several smaller local none shared hash
>tables?
>

Anything is possible.  Perhaps a small table that you probe first, then
a bigger table you probe second is an idea.  Or a more intelligent way of
distributing the tree to force probes to fall into your local memory more
often.  Lots of room for playing around here.  I've hardly thought of all
the possible solutions.  Heck, I haven't even thought of all the problems,
yet.  :)






>Thanks in advance,
>Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.