Author: Robert Hyatt
Date: 10:31:45 09/03/03
Go up one level in this thread
On September 03, 2003 at 12:05:17, Vincent Diepeveen wrote: >For questions on testing supercomputers and clusters you can email to > > Professor Aad v/d Steen >leading : high performance computing group >location : Utrecht, The Netherlands >email : steen@phys.uu.nl > >At his reports you'll find for every machine of course the one way pingpong test >too. it is very important for *all* software. How do you do a ping-pong test on a Cray T932? I just store something in memory and every other processor can see it _instantly_. Where "instantly" = about 120 nanoseconds for memory latency. > >But don't say that you know a thing from those machines. Last time you ran >crafty on one of them you had to commit fraud to let your results look good, >also that was a cray with just 16 processors. I didn't commit _any_ sort of fraud. The only fraud committed here is from yourself, as always... > >Not anything of the above. > >If you say you have logins at such machines. Show the crafty outputs. For "what machines"? A Cray? I've already done that. For the Compaq NUMA box? An NDA prevents that. For 2-way and 4-way boxes the output is all over the place, everyone is posting their results... > >Thank you, >Vincent > > > >On September 03, 2003 at 11:59:22, Robert Hyatt wrote: > >>On September 03, 2003 at 09:21:34, Vincent Diepeveen wrote: >> >>>On September 02, 2003 at 22:45:04, Robert Hyatt wrote: >>> >>>>On September 02, 2003 at 18:08:40, Vincent Diepeveen wrote: >>>> >>>>>On September 02, 2003 at 00:14:02, Jeremiah Penery wrote: >>>>> >>>>>>On September 01, 2003 at 23:23:18, Robert Hyatt wrote: >>>>>> >>>>>>>On September 01, 2003 at 09:39:55, Jeremiah Penery wrote: >>>>>>> >>>>>>>>Any large (multi-node) SMP machine will have the same problem as NUMA with >>>>>>>>respect to inter-node latency. SMP doesn't magically make node-to-node >>>>>>>>communication any faster. >>>>>>> >>>>>>>Actually it does. SMP means symmetric. >>>>>>> >>>>>>>NUMA is _not_ symmetric. >>>>>> >>>>>>Of course. The acronym means "non uniform memory access". >>>>>> >>>>>>But if you think "symmetric" necessarily means "faster", maybe you'd better look >>>>>>in a dictionary. >>>>> >>>>>You're wrong by a factor 2 or so in latency and up to factor 5 for 128 cpu's. >>>>> >>>>>16 processor alpha/sun : 10 mln $ >>>>>64 processor itanium2 : 1 mln $ >>>>> >>>>>Why would that price difference be like that? >>>>> >>>>>That 64 processor SGI altix3000 thing has the best latency of any cc-NUMA >>>>>machine. It's 2-3 us. >>>>> >>>>>Here is a 8 processor latency run at 8 processor Altix3000 which i ran yesterday >>>>>morning very early. VERY EARLY :) >>>>> >>>>>with just 400MB hash a cpu: >>>>> Average measured read read time at 8 processes = 1039.312012 ns >>>>> >>>>>with just 400MB hash a cpu: >>>>> Average measured read read time at 16 processes = 1207.127808 ns >>>>> >>>>>That is still a very good latency. SGI is superior simply here to other vendors. >>>>>Their cheap cc-NUMA machines are very superior in latency when using low number >>>>>of processors. Note that latencies might have been slightly faster when IRIX >>>>>would run at it instead of linux 2.4.20-sgi extensions enabled kernel. I'm not >>>>>sure though. >>>>> >>>>>But still you see the latest and most modern and newest hardware one can't even >>>>>get under 1 us with latency when using cc-NUMA. >>>>> >>>>>Please consider the hardware. Each brick has 2 duals. Each dual is connected >>>>>with a direct link to that other dual on the brick. >>>>> >>>>>So you can see it kind of like a quad. >>>>> >>>>>At SGI 4 cpu's latency = 280 ns (measured at TERAS - origin3800). >>>>>At SGI 8 cpu's latency = 1 us (Altix3000) >>>>>At SGI 16 cpu's latency = 1.2 us (Altix3000) >>>>> >>>>>However 8 cpu shared bus or 16 cpu shared bus the latency will be never worse >>>>>than 600 ns at a modern machine, where for CC-NUMA it goes up and up. >>>> >>>>That's wrong. 16 cpus will run into _huge_ latency issues. The BUS won't >>>>be able to keep up. That's why nobody uses a BUS on 16-way multiprocessors, >>>>it just doesn't scale that far... machines beyond 8 cpus generally are >>> >>>Look at SUN. >> >>What about them? We have some, including multiple CPU boxes. They >>perform poorly for parallel algorithms. >> >> >>> >>>>going to be NUMA, or they will be based on a _very_ expensive crossbar >>>>to connect processors and memory. Not a BUS. >>> >>>Of course. $10 mln for such machines from the past at 16 processors. >>>$1 mln for a 64 processor itanium2 cc-NUMA >>> >>>> >>>>> >>>>>A 512 processor cc-NUMA in fact is only 2 times faster latency than a cluster >>>>>has. >>>> >>>> >>>>This is why discussions with you go nowhere. You mix terms. You redefine >>>>terms. You make up specification numbers. >>> >>>>There are shared memory machines. And there are clusters. Clusters are >>> >>>cc-NUMA is shared memory too. >> >>I said that. NUMA is _not_ a "cluster" however. >> >>> >>>You can allocate memory like: >>> a = malloc(100000000000); >>> >>>NO PROBLEM. >>> >>>Just if you by accident hit a byte that's on a far processor it's a bit slower >>>:) >> >>Again, I've already said that. That is _the_ NUMA problem. >> >> >>> >>>>_not_ shared memory machines. In a cluster, nobody talks about memory >>>>latency. Everybody talks about _network_ latency. In a NUMA (or crossbar or >>> >>>Wrong. >>> >>>The one way pingpong test is used for all those machines at the same time :) >> >>Nobody in their right mind done ping-pong on a NUMA cluster. Nor on a pure >>SMP cluster like a Cray. They do it on message-passing machines _only_. And >>message-passing machines are _not_ shared memory. >> >>> >>>The shared memory is only a feature the OS delivers, sometimes speeded up by >>>special hardware hubs :) >>> >>>That's why at the origin3800 the memory controller (idem for i/o controller) is >>>called a hub and at the altix3000 the thing is on the paper 2 times faster and >>>called shub :) >>> >>>>BUS) machine, memory latency is mentioned all the time. >>>>But _not_ in a cluster. >>> >>>>> The advantage is that with a cluster you must use MPI library >>>> >>>>I have absolutely no idea what you are talking about. I've been >>>>programming clusters for 20 years, and I didn't "have to use MPI >>>>library". I did cluster stuff _before_ MPI existed. Hint: check >>>>on sockets. Not to mention PVM. OpenMP. UPC from Compaq. Etc. >>> >>>Basically you must rewrite every memory access of crafty to a function call, >>>unless linux is making it one big shared memory. You're too lazy to ever do that >>>converting. >> >>Just like I was too lazy to write DTS in the first place? _you_ had the >>chance to read about it _first_ and then ask questions, and then implement >>it. I had to do it _all_ from scratch. >> >>Talk about lazy... >> >> >>> >>>So unless the OS gives you the ability to do that huge malloc, i'm sure crafty >>>will be never efficiently working at your 8 node quad xeon. >> >>I would _never_ write it that way. Fortunately. >> >> >>> >>>>> and i'm not >>>>>using it at the great SGI machine. I simply allocate shared memory and >>>>>communicate through that with my own code. You can call it openMP of course, but >>>>>it simply is a low level parallellism. >>>>> >>>>>The big advantage of cc-NUMA is that you can run jobs of say a processor or 32 >>>>>with just worst case 2 us latency, under the condition that the OS schedules >>>>>well. >>>> >>>>NUMA scales well. It doesn't perform that great. NUMA is price-driven. >>> >>>NUMA scales well and performs well, you just must be a better program than you >>>are. That's all. >>> >>>There's plenty who are. >> >>Yes. You aren't one of them, however... >> >>> >>>>_not_ performance-driven. >>> >>>Best regards, >>>Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.