Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Intel four-way 2.8 Ghz system is just Amazing ! - Not hardly

Author: Robert Hyatt

Date: 12:18:26 11/12/03

Go up one level in this thread


On November 12, 2003 at 14:50:32, Russell Reagan wrote:

>I am not sure I understand how NUMA works compared to SMP. I have gotten the
>impression from previous discussions of NUMA that it is inferior to SMP, but
>from what I've been able to find on the net, it sounds like they can be
>complimentary to one another, so I'm confused.
>
>Are they exclusively different or is NUMA an addition to SMP?

NUMA == Non-Uniform Memory Architecture

SMP == Symmetric MultiProcessing.

In a SMP box, all processors are connected to memory using the same
datapath.  To help 4-way boxes avoid memory bottlenecks, most 4-way
boxes use 4-way memory interleaving, spreading consecutive 8-byte
chunks of memory across consecutive banks so that the reads can be
done in parallel.  IE a good 4-way box has 4x the potential memory
bandwidth of a 1-way box, assuming Intel or AMD prior to the Opteron.

In a NUMA box, each processor has local memory, but each can see all
of the other memory via routers.  The problem comes to light when all
four processors want to access memory that is local to one CPU.  The
requests become serialized and that kills performance.  The issue is to
prevent this from happening.  That's what I had to do for the Compaq
NUMA-alpha box last year.  That's what Eugene and I re-did for Crafty
version 19.5 to make it work reasonably on the Opteron NUMA boxes.

Opteron potentially has a higher memory bandwidth.  But, as always, potential
and practical don't mix.  When all processors try to beat on the memory attached
to the same physical processor, it produces a huge hot-spot with a very bad
delay.  Cache coherency has the same problems, as when I have a line of memory
in one cache controller, and that gets modified in another cache controller,
then we have a _lot_ of cache-controller to cache-controller "noise" to handle
that.  On a 4-way box, CPU A can quickly address its own local memory.  It
can not-so-quickly address memory on the two CPUs it is directly connected to.
It can even-slower address memory on the last processor as it has to go through
one of the two it is directly connected with first, which adds an extra hop.
The goal is to have everything a single processor needs reside in its local
memory.  Then you avoid the NUMA overhead for most accesses and it runs like
a bat out of Egypt.

>
>In order for a process to take advantage of SMP, it must split the work into two
>threads. What changes must be made in order for a program to be NUMA aware?


See above.  Each processor needs important stuff kept in local memory rather
than in memory that is one (or more) hops away to another CPU.



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.