Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Crafty and NUMA

Author: Robert Hyatt

Date: 10:31:45 09/03/03

Go up one level in this thread


On September 03, 2003 at 12:05:17, Vincent Diepeveen wrote:

>For questions on testing supercomputers and clusters you can email to
>
>           Professor Aad v/d Steen
>leading  : high performance computing group
>location : Utrecht, The Netherlands
>email    : steen@phys.uu.nl
>
>At his reports you'll find for every machine of course the one way pingpong test
>too. it is very important for *all* software.

How do you do a ping-pong test on a Cray T932?  I just store something in
memory and every other processor can see it _instantly_.  Where "instantly"
= about 120 nanoseconds for memory latency.

>
>But don't say that you know a thing from those machines. Last time you ran
>crafty on one of them you had to commit fraud to let your results look good,
>also that was a cray with just 16 processors.

I didn't commit _any_ sort of fraud.  The only fraud committed here is from
yourself, as always...


>
>Not anything of the above.
>
>If you say you have logins at such machines. Show the crafty outputs.

For "what machines"?  A Cray?  I've already done that.

For the Compaq NUMA box?  An NDA prevents that.

For 2-way and 4-way boxes the output is all over the place, everyone is
posting their results...


>
>Thank you,
>Vincent
>
>
>
>On September 03, 2003 at 11:59:22, Robert Hyatt wrote:
>
>>On September 03, 2003 at 09:21:34, Vincent Diepeveen wrote:
>>
>>>On September 02, 2003 at 22:45:04, Robert Hyatt wrote:
>>>
>>>>On September 02, 2003 at 18:08:40, Vincent Diepeveen wrote:
>>>>
>>>>>On September 02, 2003 at 00:14:02, Jeremiah Penery wrote:
>>>>>
>>>>>>On September 01, 2003 at 23:23:18, Robert Hyatt wrote:
>>>>>>
>>>>>>>On September 01, 2003 at 09:39:55, Jeremiah Penery wrote:
>>>>>>>
>>>>>>>>Any large (multi-node) SMP machine will have the same problem as NUMA with
>>>>>>>>respect to inter-node latency.  SMP doesn't magically make node-to-node
>>>>>>>>communication any faster.
>>>>>>>
>>>>>>>Actually it does.  SMP means symmetric.
>>>>>>>
>>>>>>>NUMA is _not_ symmetric.
>>>>>>
>>>>>>Of course.  The acronym means "non uniform memory access".
>>>>>>
>>>>>>But if you think "symmetric" necessarily means "faster", maybe you'd better look
>>>>>>in a dictionary.
>>>>>
>>>>>You're wrong by a factor 2 or so in latency and up to factor 5 for 128 cpu's.
>>>>>
>>>>>16 processor alpha/sun : 10 mln $
>>>>>64 processor itanium2  :  1 mln $
>>>>>
>>>>>Why would that price difference be like that?
>>>>>
>>>>>That 64 processor SGI altix3000 thing has the best latency of any cc-NUMA
>>>>>machine. It's 2-3 us.
>>>>>
>>>>>Here is a 8 processor latency run at 8 processor Altix3000 which i ran yesterday
>>>>>morning very early. VERY EARLY :)
>>>>>
>>>>>with just 400MB hash a cpu:
>>>>> Average measured read read time at 8 processes = 1039.312012 ns
>>>>>
>>>>>with just 400MB hash a cpu:
>>>>> Average measured read read time at 16 processes = 1207.127808 ns
>>>>>
>>>>>That is still a very good latency. SGI is superior simply here to other vendors.
>>>>>Their cheap cc-NUMA machines are very superior in latency when using low number
>>>>>of processors. Note that latencies might have been slightly faster when IRIX
>>>>>would run at it instead of linux 2.4.20-sgi extensions enabled kernel. I'm not
>>>>>sure though.
>>>>>
>>>>>But still you see the latest and most modern and newest hardware one can't even
>>>>>get under 1 us with latency when using cc-NUMA.
>>>>>
>>>>>Please consider the hardware. Each brick has 2 duals. Each dual is connected
>>>>>with a direct link to that other dual on the brick.
>>>>>
>>>>>So you can see it kind of like a quad.
>>>>>
>>>>>At SGI 4 cpu's  latency = 280 ns (measured at TERAS - origin3800).
>>>>>At SGI 8 cpu's  latency =   1 us (Altix3000)
>>>>>At SGI 16 cpu's latency = 1.2 us (Altix3000)
>>>>>
>>>>>However 8 cpu shared bus or 16 cpu shared bus the latency will be never worse
>>>>>than 600 ns at a modern machine, where for CC-NUMA it goes up and up.
>>>>
>>>>That's wrong.  16 cpus will run into _huge_ latency issues.  The BUS won't
>>>>be able to keep up.  That's why nobody uses a BUS on 16-way multiprocessors,
>>>>it just doesn't scale that far...  machines beyond 8 cpus generally are
>>>
>>>Look at SUN.
>>
>>What about them?  We have some, including multiple CPU boxes.  They
>>perform poorly for parallel algorithms.
>>
>>
>>>
>>>>going to be NUMA, or they will be based on a _very_ expensive crossbar
>>>>to connect processors and memory.  Not a BUS.
>>>
>>>Of course. $10 mln for such machines from the past at 16 processors.
>>>$1 mln for a 64 processor itanium2 cc-NUMA
>>>
>>>>
>>>>>
>>>>>A 512 processor cc-NUMA in fact is only 2 times faster latency than a cluster
>>>>>has.
>>>>
>>>>
>>>>This is why discussions with you go nowhere.  You mix terms.  You redefine
>>>>terms.  You make up specification numbers.
>>>
>>>>There are shared memory machines.  And there are clusters.  Clusters are
>>>
>>>cc-NUMA is shared memory too.
>>
>>I said that.  NUMA is _not_ a "cluster" however.
>>
>>>
>>>You can allocate memory like:
>>>  a = malloc(100000000000);
>>>
>>>NO PROBLEM.
>>>
>>>Just if you by accident hit a byte that's on a far processor it's a bit slower
>>>:)
>>
>>Again, I've already said that.  That is _the_ NUMA problem.
>>
>>
>>>
>>>>_not_ shared memory machines.  In a cluster, nobody talks about memory
>>>>latency.  Everybody talks about _network_ latency.  In a NUMA (or crossbar or
>>>
>>>Wrong.
>>>
>>>The one way pingpong test is used for all those machines at the same time :)
>>
>>Nobody in their right mind done ping-pong on a NUMA cluster.  Nor on a pure
>>SMP cluster like a Cray.  They do it on message-passing machines _only_.  And
>>message-passing machines are _not_ shared memory.
>>
>>>
>>>The shared memory is only a feature the OS delivers, sometimes speeded up by
>>>special hardware hubs :)
>>>
>>>That's why at the origin3800 the memory controller (idem for i/o controller) is
>>>called a hub and at the altix3000 the thing is on the paper 2 times faster and
>>>called shub :)
>>>
>>>>BUS) machine, memory latency is mentioned all the time.
>>>>But _not_ in a cluster.
>>>
>>>>> The advantage is that with a cluster you must use MPI library
>>>>
>>>>I have absolutely no idea what you are talking about.  I've been
>>>>programming clusters for 20 years, and I didn't "have to use MPI
>>>>library".  I did cluster stuff _before_ MPI existed.  Hint:  check
>>>>on sockets.  Not to mention PVM.  OpenMP.  UPC from Compaq.  Etc.
>>>
>>>Basically you must rewrite every memory access of crafty to a function call,
>>>unless linux is making it one big shared memory. You're too lazy to ever do that
>>>converting.
>>
>>Just like I was too lazy to write DTS in the first place?  _you_ had the
>>chance to read about it _first_ and then ask questions, and then implement
>>it.  I had to do it _all_ from scratch.
>>
>>Talk about lazy...
>>
>>
>>>
>>>So unless the OS gives you the ability to do that huge malloc, i'm sure crafty
>>>will be never efficiently working at your 8 node quad xeon.
>>
>>I would _never_ write it that way.  Fortunately.
>>
>>
>>>
>>>>> and i'm not
>>>>>using it at the great SGI machine. I simply allocate shared memory and
>>>>>communicate through that with my own code. You can call it openMP of course, but
>>>>>it simply is a low level parallellism.
>>>>>
>>>>>The big advantage of cc-NUMA is that you can run jobs of say a processor or 32
>>>>>with just worst case 2 us latency, under the condition that the OS schedules
>>>>>well.
>>>>
>>>>NUMA scales well.  It doesn't perform that great.  NUMA is price-driven.
>>>
>>>NUMA scales well and performs well, you just must be a better program than you
>>>are. That's all.
>>>
>>>There's plenty who are.
>>
>>Yes.  You aren't one of them, however...
>>
>>>
>>>>_not_ performance-driven.
>>>
>>>Best regards,
>>>Vincent



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.