Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Intel four-way 2.8 Ghz system is just Amazing ! - Not hardly

Author: Gerd Isenberg

Date: 04:56:11 11/13/03

Go up one level in this thread


On November 12, 2003 at 15:18:26, Robert Hyatt wrote:

>On November 12, 2003 at 14:50:32, Russell Reagan wrote:
>
>>I am not sure I understand how NUMA works compared to SMP. I have gotten the
>>impression from previous discussions of NUMA that it is inferior to SMP, but
>>from what I've been able to find on the net, it sounds like they can be
>>complimentary to one another, so I'm confused.
>>
>>Are they exclusively different or is NUMA an addition to SMP?
>
>NUMA == Non-Uniform Memory Architecture
>
>SMP == Symmetric MultiProcessing.
>
>In a SMP box, all processors are connected to memory using the same
>datapath.  To help 4-way boxes avoid memory bottlenecks, most 4-way
>boxes use 4-way memory interleaving, spreading consecutive 8-byte
>chunks of memory across consecutive banks so that the reads can be
>done in parallel.  IE a good 4-way box has 4x the potential memory
>bandwidth of a 1-way box, assuming Intel or AMD prior to the Opteron.
>
>In a NUMA box, each processor has local memory, but each can see all
>of the other memory via routers.  The problem comes to light when all
>four processors want to access memory that is local to one CPU.  The
>requests become serialized and that kills performance.  The issue is to
>prevent this from happening.  That's what I had to do for the Compaq
>NUMA-alpha box last year.  That's what Eugene and I re-did for Crafty
>version 19.5 to make it work reasonably on the Opteron NUMA boxes.
>
>Opteron potentially has a higher memory bandwidth.  But, as always, potential
>and practical don't mix.  When all processors try to beat on the memory attached
>to the same physical processor, it produces a huge hot-spot with a very bad
>delay.  Cache coherency has the same problems, as when I have a line of memory
>in one cache controller, and that gets modified in another cache controller,
>then we have a _lot_ of cache-controller to cache-controller "noise" to handle
>that.  On a 4-way box, CPU A can quickly address its own local memory.  It
>can not-so-quickly address memory on the two CPUs it is directly connected to.
>It can even-slower address memory on the last processor as it has to go through
>one of the two it is directly connected with first, which adds an extra hop.
>The goal is to have everything a single processor needs reside in its local
>memory.  Then you avoid the NUMA overhead for most accesses and it runs like
>a bat out of Egypt.
>
>>
>>In order for a process to take advantage of SMP, it must split the work into two
>>threads. What changes must be made in order for a program to be NUMA aware?
>
>
>See above.  Each processor needs important stuff kept in local memory rather
>than in memory that is one (or more) hops away to another CPU.

Hi Bob,

as an SMP/NUMA-novice i have some questions too.
When you do Multithreading on Opterons N-way boxes, each thread has it's own
local memory for automatic / local variables on it's own stack close to the
processor.

What about (shared) global memory? Is it allocated in the local memory of the
processor attached to the first (main) thread, and therefore other threads have
to access it via hypertransport protocol with those relative huge latencies?

Is there a way to allocate thread local memory, and to duplicate constant shared
data like rotated lookup tables at startup time, to have it close to each thread
using it?

What about transposition tables and NUMA?
May one consider a two level table approach, e.g. one huge shared main
transposition table, only accessed by search threads if remaining search depth
is greater/equal some threshold and several smaller local none shared hash
tables?

Thanks in advance,
Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.