Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Intel four-way 2.8 Ghz system is just Amazing ! - Not hardly

Author: Russell Reagan

Date: 13:16:49 11/12/03

Go up one level in this thread


On November 12, 2003 at 15:18:26, Robert Hyatt wrote:

>In a SMP box, all processors are connected to memory using the same
>datapath.  To help 4-way boxes avoid memory bottlenecks, most 4-way
>boxes use 4-way memory interleaving, spreading consecutive 8-byte
>chunks of memory across consecutive banks so that the reads can be
>done in parallel.  IE a good 4-way box has 4x the potential memory
>bandwidth of a 1-way box, assuming Intel or AMD prior to the Opteron.


Is it the motherboard that determines whether an SMP box is a "good" SMP box (IE
4-way box having 4x memory bandwidth)?

Let me see if I understand now. NUMA is used on SMP systems, and the two are not
exclusive. I was under the impression from previous conversations here that you
could have an SMP box, or a NUMA box, but not both, and this is not true. The
problem is that it is very expensive to create a true SMP box (Nx bandwidth for
N CPUs) when you have many CPUs. NUMA is just a cheaper way to handle the
problem of all CPUs sharing the main memory bus. Is this correct?



>In a NUMA box, each processor has local memory, but each can see all
>of the other memory via routers.  The problem comes to light when all
>four processors want to access memory that is local to one CPU.  The
>requests become serialized and that kills performance.  The issue is to
>prevent this from happening.  That's what I had to do for the Compaq
>NUMA-alpha box last year.  That's what Eugene and I re-did for Crafty
>version 19.5 to make it work reasonably on the Opteron NUMA boxes.


Are all Opteron systems with multiple CPUs NUMA, or are there some that are true
SMP?



>Opteron potentially has a higher memory bandwidth.  But, as always, potential
>and practical don't mix.  When all processors try to beat on the memory attached
>to the same physical processor, it produces a huge hot-spot with a very bad
>delay.  Cache coherency has the same problems, as when I have a line of memory
>in one cache controller, and that gets modified in another cache controller,
>then we have a _lot_ of cache-controller to cache-controller "noise" to handle
>that.  On a 4-way box, CPU A can quickly address its own local memory.  It
>can not-so-quickly address memory on the two CPUs it is directly connected to.
>It can even-slower address memory on the last processor as it has to go through
>one of the two it is directly connected with first, which adds an extra hop.
>The goal is to have everything a single processor needs reside in its local
>memory.  Then you avoid the NUMA overhead for most accesses and it runs like
>a bat out of Egypt.


Okay, that makes sense. But how can you ensure that everything a processor needs
will be in its local memory? If you start up multiple threads, I thought there
was no guarentee about which thread would run on which CPU. I thought that was
up to the OS, which could swap them out as it saw fit.

Assuming you can keep threads on the same CPUs, does the task then become to
keep the data used by that thread small enough to stay in the CPU's local
memory? Or do you make a copy of the data for each CPU?

Is the local CPU memory the same thing as the CPU cache?

Is there still only one main memory bus on a NUMA machine? Or is the main memory
split into the local CPU memory and accessed via the NUMA stuff?



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.