Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Intel four-way 2.8 Ghz system is just Amazing ! - Not hardly

Author: Robert Hyatt

Date: 15:09:31 11/12/03

Go up one level in this thread


On November 12, 2003 at 16:16:49, Russell Reagan wrote:

>On November 12, 2003 at 15:18:26, Robert Hyatt wrote:
>
>>In a SMP box, all processors are connected to memory using the same
>>datapath.  To help 4-way boxes avoid memory bottlenecks, most 4-way
>>boxes use 4-way memory interleaving, spreading consecutive 8-byte
>>chunks of memory across consecutive banks so that the reads can be
>>done in parallel.  IE a good 4-way box has 4x the potential memory
>>bandwidth of a 1-way box, assuming Intel or AMD prior to the Opteron.
>
>
>Is it the motherboard that determines whether an SMP box is a "good" SMP box (IE
>4-way box having 4x memory bandwidth)?

yes and no.  Actually it is the "chipset" that supports the SMP
configuration.  Intel has done several, as have a few others.  Cheap
duals don't do interleaving.  Good duals do.  But they are more expensive
in the meantime.


>
>Let me see if I understand now. NUMA is used on SMP systems, and the two are not
>exclusive. I was under the impression from previous conversations here that you
>could have an SMP box, or a NUMA box, but not both, and this is not true. The
>problem is that it is very expensive to create a true SMP box (Nx bandwidth for
>N CPUs) when you have many CPUs. NUMA is just a cheaper way to handle the
>problem of all CPUs sharing the main memory bus. Is this correct?


It depends on how you use SMP.  Remember that the "S" stands for
"symmetric" which means everything is equal between the processors.  In
a NUMA box, this is not exactly true.  All processors might be able to
initiate I/O.  All might be able to field interrupts.  But memory latency
varies depending on the memory address you try to grab and how far it is
away from you through the routers.

I would not call a NUMA box "SMP" myself since everything is not symmetric.
But if you ignore the memory latency problem, you could use the two terms
interchangably, although it would confuse many (including myself).





>
>
>
>>In a NUMA box, each processor has local memory, but each can see all
>>of the other memory via routers.  The problem comes to light when all
>>four processors want to access memory that is local to one CPU.  The
>>requests become serialized and that kills performance.  The issue is to
>>prevent this from happening.  That's what I had to do for the Compaq
>>NUMA-alpha box last year.  That's what Eugene and I re-did for Crafty
>>version 19.5 to make it work reasonably on the Opteron NUMA boxes.
>
>
>Are all Opteron systems with multiple CPUs NUMA, or are there some that are true
>SMP?

Yes.  That is the idea behind the Opteron, in fact.  Any multiple-cpu
Opteron is going to be NUMA.




>
>
>
>>Opteron potentially has a higher memory bandwidth.  But, as always, potential
>>and practical don't mix.  When all processors try to beat on the memory attached
>>to the same physical processor, it produces a huge hot-spot with a very bad
>>delay.  Cache coherency has the same problems, as when I have a line of memory
>>in one cache controller, and that gets modified in another cache controller,
>>then we have a _lot_ of cache-controller to cache-controller "noise" to handle
>>that.  On a 4-way box, CPU A can quickly address its own local memory.  It
>>can not-so-quickly address memory on the two CPUs it is directly connected to.
>>It can even-slower address memory on the last processor as it has to go through
>>one of the two it is directly connected with first, which adds an extra hop.
>>The goal is to have everything a single processor needs reside in its local
>>memory.  Then you avoid the NUMA overhead for most accesses and it runs like
>>a bat out of Egypt.
>
>
>Okay, that makes sense. But how can you ensure that everything a processor needs
>will be in its local memory? If you start up multiple threads, I thought there
>was no guarentee about which thread would run on which CPU. I thought that was
>up to the OS, which could swap them out as it saw fit.
>

You have to jump through some hoops, and also hope that the O/S helps.  On
windows, Eugene is using the set processor affinity mechanism to lock a
thread to a particular CPU.  Then that thread malloc's its stuff so that it
is on that memory, even though it can be shared with others.  As you might
guess, this makes NUMA different.  IE my quads don't have this issue since
I am running the same code in the same virtual address space on all threads.
And since all CPUs can access anywhere in memory equally efficiently, there
is no such issue.

But as the Dodge commercials used to say, "the rules have changed" when you
talk about NUMA platforms.  And suddenly you have to start worrying about
issues that ought to be abstracted away into some hardware support.  But
with NUMA that support has to be within your program for optimal results.





>Assuming you can keep threads on the same CPUs, does the task then become to
>keep the data used by that thread small enough to stay in the CPU's local
>memory? Or do you make a copy of the data for each CPU?

You could do either.  Remember that each cpu has 1/nth of the total memory.
Size is not really an issue.  A quad might have 4 gigs of RAM to start,
meaning each CPU has 1gb that is local, 2 gigabytes that is slower than
local, and 1 gigabyte that is 2x slower then the first non-local memory
I mentioned.

You just have to look at the algorithm and put the important data on that
local memory.  I do this with my split blocks in Crafty...





>
>Is the local CPU memory the same thing as the CPU cache?

No.  Each CPU still has a cache.  But the memory is distributed across
all processors.  For a dual opteron with 4 gigs of ram, each CPU has two
gigs of local memory.  But it can see/access the other two gigs, but at an
increased latency.

For a quad, you have 1/4 local, 1/2 close, 1/4 way remote.

For an 8-way box, you have 4 latencies, local, 1 hop, 2 hops and 3 hops.
You better not put often-used stuff in the 2-3 hop memory.



>
>Is there still only one main memory bus on a NUMA machine? Or is the main memory
>split into the local CPU memory and accessed via the NUMA stuff?

The idea is that you have a CPU connected to a "router" which is
connected to the local memory for that CPU, plus the router is
connected to other routers for other processors.  Your local router can
access your local memory and give it to you quickly.  To access memory on
other processors requires that you ask your router for the memory, and it
has to ask a router it can reach to either give it the value or forward the
request on to a router that is closer, until the request finally arrives
at the router connected to the local memory.  It's all those "hops" that
kill performance.  So you just have to understand that shortcoming of the
NUMA architecture and work around it.  The up-side is that it is very
hard to scale an SMP box beyond 4 cpus.  Intel did it with their FUSION
chipset a couple of years back, but their machine looks like two 4-way
boxes coupled with a kludge, and it doesn't perform very well as memory
is still 4-way interleaved, but with 8 processors demanding data.  The
NUMA approach scales better, cost-wise, but there is a performance issue
that must be addressed.





This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.