Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Crafty and NUMA

Author: Vincent Diepeveen

Date: 15:08:40 09/02/03

Go up one level in this thread


On September 02, 2003 at 00:14:02, Jeremiah Penery wrote:

>On September 01, 2003 at 23:23:18, Robert Hyatt wrote:
>
>>On September 01, 2003 at 09:39:55, Jeremiah Penery wrote:
>>
>>>Any large (multi-node) SMP machine will have the same problem as NUMA with
>>>respect to inter-node latency.  SMP doesn't magically make node-to-node
>>>communication any faster.
>>
>>Actually it does.  SMP means symmetric.
>>
>>NUMA is _not_ symmetric.
>
>Of course.  The acronym means "non uniform memory access".
>
>But if you think "symmetric" necessarily means "faster", maybe you'd better look
>in a dictionary.

You're wrong by a factor 2 or so in latency and up to factor 5 for 128 cpu's.

16 processor alpha/sun : 10 mln $
64 processor itanium2  :  1 mln $

Why would that price difference be like that?

That 64 processor SGI altix3000 thing has the best latency of any cc-NUMA
machine. It's 2-3 us.

Here is a 8 processor latency run at 8 processor Altix3000 which i ran yesterday
morning very early. VERY EARLY :)

with just 400MB hash a cpu:
 Average measured read read time at 8 processes = 1039.312012 ns

with just 400MB hash a cpu:
 Average measured read read time at 16 processes = 1207.127808 ns

That is still a very good latency. SGI is superior simply here to other vendors.
Their cheap cc-NUMA machines are very superior in latency when using low number
of processors. Note that latencies might have been slightly faster when IRIX
would run at it instead of linux 2.4.20-sgi extensions enabled kernel. I'm not
sure though.

But still you see the latest and most modern and newest hardware one can't even
get under 1 us with latency when using cc-NUMA.

Please consider the hardware. Each brick has 2 duals. Each dual is connected
with a direct link to that other dual on the brick.

So you can see it kind of like a quad.

At SGI 4 cpu's  latency = 280 ns (measured at TERAS - origin3800).
At SGI 8 cpu's  latency =   1 us (Altix3000)
At SGI 16 cpu's latency = 1.2 us (Altix3000)

However 8 cpu shared bus or 16 cpu shared bus the latency will be never worse
than 600 ns at a modern machine, where for CC-NUMA it goes up and up.

A 512 processor cc-NUMA in fact is only 2 times faster latency than a cluster
has. The advantage is that with a cluster you must use MPI library and i'm not
using it at the great SGI machine. I simply allocate shared memory and
communicate through that with my own code. You can call it openMP of course, but
it simply is a low level parallellism.

The big advantage of cc-NUMA is that you can run jobs of say a processor or 32
with just worst case 2 us latency, under the condition that the OS schedules
well.






This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.