Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Crafty and NUMA

Author: Vincent Diepeveen

Date: 09:05:17 09/03/03

Go up one level in this thread


For questions on testing supercomputers and clusters you can email to

           Professor Aad v/d Steen
leading  : high performance computing group
location : Utrecht, The Netherlands
email    : steen@phys.uu.nl

At his reports you'll find for every machine of course the one way pingpong test
too. it is very important for *all* software.

But don't say that you know a thing from those machines. Last time you ran
crafty on one of them you had to commit fraud to let your results look good,
also that was a cray with just 16 processors.

Not anything of the above.

If you say you have logins at such machines. Show the crafty outputs.

Thank you,
Vincent



On September 03, 2003 at 11:59:22, Robert Hyatt wrote:

>On September 03, 2003 at 09:21:34, Vincent Diepeveen wrote:
>
>>On September 02, 2003 at 22:45:04, Robert Hyatt wrote:
>>
>>>On September 02, 2003 at 18:08:40, Vincent Diepeveen wrote:
>>>
>>>>On September 02, 2003 at 00:14:02, Jeremiah Penery wrote:
>>>>
>>>>>On September 01, 2003 at 23:23:18, Robert Hyatt wrote:
>>>>>
>>>>>>On September 01, 2003 at 09:39:55, Jeremiah Penery wrote:
>>>>>>
>>>>>>>Any large (multi-node) SMP machine will have the same problem as NUMA with
>>>>>>>respect to inter-node latency.  SMP doesn't magically make node-to-node
>>>>>>>communication any faster.
>>>>>>
>>>>>>Actually it does.  SMP means symmetric.
>>>>>>
>>>>>>NUMA is _not_ symmetric.
>>>>>
>>>>>Of course.  The acronym means "non uniform memory access".
>>>>>
>>>>>But if you think "symmetric" necessarily means "faster", maybe you'd better look
>>>>>in a dictionary.
>>>>
>>>>You're wrong by a factor 2 or so in latency and up to factor 5 for 128 cpu's.
>>>>
>>>>16 processor alpha/sun : 10 mln $
>>>>64 processor itanium2  :  1 mln $
>>>>
>>>>Why would that price difference be like that?
>>>>
>>>>That 64 processor SGI altix3000 thing has the best latency of any cc-NUMA
>>>>machine. It's 2-3 us.
>>>>
>>>>Here is a 8 processor latency run at 8 processor Altix3000 which i ran yesterday
>>>>morning very early. VERY EARLY :)
>>>>
>>>>with just 400MB hash a cpu:
>>>> Average measured read read time at 8 processes = 1039.312012 ns
>>>>
>>>>with just 400MB hash a cpu:
>>>> Average measured read read time at 16 processes = 1207.127808 ns
>>>>
>>>>That is still a very good latency. SGI is superior simply here to other vendors.
>>>>Their cheap cc-NUMA machines are very superior in latency when using low number
>>>>of processors. Note that latencies might have been slightly faster when IRIX
>>>>would run at it instead of linux 2.4.20-sgi extensions enabled kernel. I'm not
>>>>sure though.
>>>>
>>>>But still you see the latest and most modern and newest hardware one can't even
>>>>get under 1 us with latency when using cc-NUMA.
>>>>
>>>>Please consider the hardware. Each brick has 2 duals. Each dual is connected
>>>>with a direct link to that other dual on the brick.
>>>>
>>>>So you can see it kind of like a quad.
>>>>
>>>>At SGI 4 cpu's  latency = 280 ns (measured at TERAS - origin3800).
>>>>At SGI 8 cpu's  latency =   1 us (Altix3000)
>>>>At SGI 16 cpu's latency = 1.2 us (Altix3000)
>>>>
>>>>However 8 cpu shared bus or 16 cpu shared bus the latency will be never worse
>>>>than 600 ns at a modern machine, where for CC-NUMA it goes up and up.
>>>
>>>That's wrong.  16 cpus will run into _huge_ latency issues.  The BUS won't
>>>be able to keep up.  That's why nobody uses a BUS on 16-way multiprocessors,
>>>it just doesn't scale that far...  machines beyond 8 cpus generally are
>>
>>Look at SUN.
>
>What about them?  We have some, including multiple CPU boxes.  They
>perform poorly for parallel algorithms.
>
>
>>
>>>going to be NUMA, or they will be based on a _very_ expensive crossbar
>>>to connect processors and memory.  Not a BUS.
>>
>>Of course. $10 mln for such machines from the past at 16 processors.
>>$1 mln for a 64 processor itanium2 cc-NUMA
>>
>>>
>>>>
>>>>A 512 processor cc-NUMA in fact is only 2 times faster latency than a cluster
>>>>has.
>>>
>>>
>>>This is why discussions with you go nowhere.  You mix terms.  You redefine
>>>terms.  You make up specification numbers.
>>
>>>There are shared memory machines.  And there are clusters.  Clusters are
>>
>>cc-NUMA is shared memory too.
>
>I said that.  NUMA is _not_ a "cluster" however.
>
>>
>>You can allocate memory like:
>>  a = malloc(100000000000);
>>
>>NO PROBLEM.
>>
>>Just if you by accident hit a byte that's on a far processor it's a bit slower
>>:)
>
>Again, I've already said that.  That is _the_ NUMA problem.
>
>
>>
>>>_not_ shared memory machines.  In a cluster, nobody talks about memory
>>>latency.  Everybody talks about _network_ latency.  In a NUMA (or crossbar or
>>
>>Wrong.
>>
>>The one way pingpong test is used for all those machines at the same time :)
>
>Nobody in their right mind done ping-pong on a NUMA cluster.  Nor on a pure
>SMP cluster like a Cray.  They do it on message-passing machines _only_.  And
>message-passing machines are _not_ shared memory.
>
>>
>>The shared memory is only a feature the OS delivers, sometimes speeded up by
>>special hardware hubs :)
>>
>>That's why at the origin3800 the memory controller (idem for i/o controller) is
>>called a hub and at the altix3000 the thing is on the paper 2 times faster and
>>called shub :)
>>
>>>BUS) machine, memory latency is mentioned all the time.
>>>But _not_ in a cluster.
>>
>>>> The advantage is that with a cluster you must use MPI library
>>>
>>>I have absolutely no idea what you are talking about.  I've been
>>>programming clusters for 20 years, and I didn't "have to use MPI
>>>library".  I did cluster stuff _before_ MPI existed.  Hint:  check
>>>on sockets.  Not to mention PVM.  OpenMP.  UPC from Compaq.  Etc.
>>
>>Basically you must rewrite every memory access of crafty to a function call,
>>unless linux is making it one big shared memory. You're too lazy to ever do that
>>converting.
>
>Just like I was too lazy to write DTS in the first place?  _you_ had the
>chance to read about it _first_ and then ask questions, and then implement
>it.  I had to do it _all_ from scratch.
>
>Talk about lazy...
>
>
>>
>>So unless the OS gives you the ability to do that huge malloc, i'm sure crafty
>>will be never efficiently working at your 8 node quad xeon.
>
>I would _never_ write it that way.  Fortunately.
>
>
>>
>>>> and i'm not
>>>>using it at the great SGI machine. I simply allocate shared memory and
>>>>communicate through that with my own code. You can call it openMP of course, but
>>>>it simply is a low level parallellism.
>>>>
>>>>The big advantage of cc-NUMA is that you can run jobs of say a processor or 32
>>>>with just worst case 2 us latency, under the condition that the OS schedules
>>>>well.
>>>
>>>NUMA scales well.  It doesn't perform that great.  NUMA is price-driven.
>>
>>NUMA scales well and performs well, you just must be a better program than you
>>are. That's all.
>>
>>There's plenty who are.
>
>Yes.  You aren't one of them, however...
>
>>
>>>_not_ performance-driven.
>>
>>Best regards,
>>Vincent



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.