Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Crafty and NUMA

Author: Robert Hyatt

Date: 08:59:22 09/03/03

Go up one level in this thread


On September 03, 2003 at 09:21:34, Vincent Diepeveen wrote:

>On September 02, 2003 at 22:45:04, Robert Hyatt wrote:
>
>>On September 02, 2003 at 18:08:40, Vincent Diepeveen wrote:
>>
>>>On September 02, 2003 at 00:14:02, Jeremiah Penery wrote:
>>>
>>>>On September 01, 2003 at 23:23:18, Robert Hyatt wrote:
>>>>
>>>>>On September 01, 2003 at 09:39:55, Jeremiah Penery wrote:
>>>>>
>>>>>>Any large (multi-node) SMP machine will have the same problem as NUMA with
>>>>>>respect to inter-node latency.  SMP doesn't magically make node-to-node
>>>>>>communication any faster.
>>>>>
>>>>>Actually it does.  SMP means symmetric.
>>>>>
>>>>>NUMA is _not_ symmetric.
>>>>
>>>>Of course.  The acronym means "non uniform memory access".
>>>>
>>>>But if you think "symmetric" necessarily means "faster", maybe you'd better look
>>>>in a dictionary.
>>>
>>>You're wrong by a factor 2 or so in latency and up to factor 5 for 128 cpu's.
>>>
>>>16 processor alpha/sun : 10 mln $
>>>64 processor itanium2  :  1 mln $
>>>
>>>Why would that price difference be like that?
>>>
>>>That 64 processor SGI altix3000 thing has the best latency of any cc-NUMA
>>>machine. It's 2-3 us.
>>>
>>>Here is a 8 processor latency run at 8 processor Altix3000 which i ran yesterday
>>>morning very early. VERY EARLY :)
>>>
>>>with just 400MB hash a cpu:
>>> Average measured read read time at 8 processes = 1039.312012 ns
>>>
>>>with just 400MB hash a cpu:
>>> Average measured read read time at 16 processes = 1207.127808 ns
>>>
>>>That is still a very good latency. SGI is superior simply here to other vendors.
>>>Their cheap cc-NUMA machines are very superior in latency when using low number
>>>of processors. Note that latencies might have been slightly faster when IRIX
>>>would run at it instead of linux 2.4.20-sgi extensions enabled kernel. I'm not
>>>sure though.
>>>
>>>But still you see the latest and most modern and newest hardware one can't even
>>>get under 1 us with latency when using cc-NUMA.
>>>
>>>Please consider the hardware. Each brick has 2 duals. Each dual is connected
>>>with a direct link to that other dual on the brick.
>>>
>>>So you can see it kind of like a quad.
>>>
>>>At SGI 4 cpu's  latency = 280 ns (measured at TERAS - origin3800).
>>>At SGI 8 cpu's  latency =   1 us (Altix3000)
>>>At SGI 16 cpu's latency = 1.2 us (Altix3000)
>>>
>>>However 8 cpu shared bus or 16 cpu shared bus the latency will be never worse
>>>than 600 ns at a modern machine, where for CC-NUMA it goes up and up.
>>
>>That's wrong.  16 cpus will run into _huge_ latency issues.  The BUS won't
>>be able to keep up.  That's why nobody uses a BUS on 16-way multiprocessors,
>>it just doesn't scale that far...  machines beyond 8 cpus generally are
>
>Look at SUN.

What about them?  We have some, including multiple CPU boxes.  They
perform poorly for parallel algorithms.


>
>>going to be NUMA, or they will be based on a _very_ expensive crossbar
>>to connect processors and memory.  Not a BUS.
>
>Of course. $10 mln for such machines from the past at 16 processors.
>$1 mln for a 64 processor itanium2 cc-NUMA
>
>>
>>>
>>>A 512 processor cc-NUMA in fact is only 2 times faster latency than a cluster
>>>has.
>>
>>
>>This is why discussions with you go nowhere.  You mix terms.  You redefine
>>terms.  You make up specification numbers.
>
>>There are shared memory machines.  And there are clusters.  Clusters are
>
>cc-NUMA is shared memory too.

I said that.  NUMA is _not_ a "cluster" however.

>
>You can allocate memory like:
>  a = malloc(100000000000);
>
>NO PROBLEM.
>
>Just if you by accident hit a byte that's on a far processor it's a bit slower
>:)

Again, I've already said that.  That is _the_ NUMA problem.


>
>>_not_ shared memory machines.  In a cluster, nobody talks about memory
>>latency.  Everybody talks about _network_ latency.  In a NUMA (or crossbar or
>
>Wrong.
>
>The one way pingpong test is used for all those machines at the same time :)

Nobody in their right mind done ping-pong on a NUMA cluster.  Nor on a pure
SMP cluster like a Cray.  They do it on message-passing machines _only_.  And
message-passing machines are _not_ shared memory.

>
>The shared memory is only a feature the OS delivers, sometimes speeded up by
>special hardware hubs :)
>
>That's why at the origin3800 the memory controller (idem for i/o controller) is
>called a hub and at the altix3000 the thing is on the paper 2 times faster and
>called shub :)
>
>>BUS) machine, memory latency is mentioned all the time.
>>But _not_ in a cluster.
>
>>> The advantage is that with a cluster you must use MPI library
>>
>>I have absolutely no idea what you are talking about.  I've been
>>programming clusters for 20 years, and I didn't "have to use MPI
>>library".  I did cluster stuff _before_ MPI existed.  Hint:  check
>>on sockets.  Not to mention PVM.  OpenMP.  UPC from Compaq.  Etc.
>
>Basically you must rewrite every memory access of crafty to a function call,
>unless linux is making it one big shared memory. You're too lazy to ever do that
>converting.

Just like I was too lazy to write DTS in the first place?  _you_ had the
chance to read about it _first_ and then ask questions, and then implement
it.  I had to do it _all_ from scratch.

Talk about lazy...


>
>So unless the OS gives you the ability to do that huge malloc, i'm sure crafty
>will be never efficiently working at your 8 node quad xeon.

I would _never_ write it that way.  Fortunately.


>
>>> and i'm not
>>>using it at the great SGI machine. I simply allocate shared memory and
>>>communicate through that with my own code. You can call it openMP of course, but
>>>it simply is a low level parallellism.
>>>
>>>The big advantage of cc-NUMA is that you can run jobs of say a processor or 32
>>>with just worst case 2 us latency, under the condition that the OS schedules
>>>well.
>>
>>NUMA scales well.  It doesn't perform that great.  NUMA is price-driven.
>
>NUMA scales well and performs well, you just must be a better program than you
>are. That's all.
>
>There's plenty who are.

Yes.  You aren't one of them, however...

>
>>_not_ performance-driven.
>
>Best regards,
>Vincent



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.