Author: Robert Hyatt
Date: 08:59:22 09/03/03
Go up one level in this thread
On September 03, 2003 at 09:21:34, Vincent Diepeveen wrote: >On September 02, 2003 at 22:45:04, Robert Hyatt wrote: > >>On September 02, 2003 at 18:08:40, Vincent Diepeveen wrote: >> >>>On September 02, 2003 at 00:14:02, Jeremiah Penery wrote: >>> >>>>On September 01, 2003 at 23:23:18, Robert Hyatt wrote: >>>> >>>>>On September 01, 2003 at 09:39:55, Jeremiah Penery wrote: >>>>> >>>>>>Any large (multi-node) SMP machine will have the same problem as NUMA with >>>>>>respect to inter-node latency. SMP doesn't magically make node-to-node >>>>>>communication any faster. >>>>> >>>>>Actually it does. SMP means symmetric. >>>>> >>>>>NUMA is _not_ symmetric. >>>> >>>>Of course. The acronym means "non uniform memory access". >>>> >>>>But if you think "symmetric" necessarily means "faster", maybe you'd better look >>>>in a dictionary. >>> >>>You're wrong by a factor 2 or so in latency and up to factor 5 for 128 cpu's. >>> >>>16 processor alpha/sun : 10 mln $ >>>64 processor itanium2 : 1 mln $ >>> >>>Why would that price difference be like that? >>> >>>That 64 processor SGI altix3000 thing has the best latency of any cc-NUMA >>>machine. It's 2-3 us. >>> >>>Here is a 8 processor latency run at 8 processor Altix3000 which i ran yesterday >>>morning very early. VERY EARLY :) >>> >>>with just 400MB hash a cpu: >>> Average measured read read time at 8 processes = 1039.312012 ns >>> >>>with just 400MB hash a cpu: >>> Average measured read read time at 16 processes = 1207.127808 ns >>> >>>That is still a very good latency. SGI is superior simply here to other vendors. >>>Their cheap cc-NUMA machines are very superior in latency when using low number >>>of processors. Note that latencies might have been slightly faster when IRIX >>>would run at it instead of linux 2.4.20-sgi extensions enabled kernel. I'm not >>>sure though. >>> >>>But still you see the latest and most modern and newest hardware one can't even >>>get under 1 us with latency when using cc-NUMA. >>> >>>Please consider the hardware. Each brick has 2 duals. Each dual is connected >>>with a direct link to that other dual on the brick. >>> >>>So you can see it kind of like a quad. >>> >>>At SGI 4 cpu's latency = 280 ns (measured at TERAS - origin3800). >>>At SGI 8 cpu's latency = 1 us (Altix3000) >>>At SGI 16 cpu's latency = 1.2 us (Altix3000) >>> >>>However 8 cpu shared bus or 16 cpu shared bus the latency will be never worse >>>than 600 ns at a modern machine, where for CC-NUMA it goes up and up. >> >>That's wrong. 16 cpus will run into _huge_ latency issues. The BUS won't >>be able to keep up. That's why nobody uses a BUS on 16-way multiprocessors, >>it just doesn't scale that far... machines beyond 8 cpus generally are > >Look at SUN. What about them? We have some, including multiple CPU boxes. They perform poorly for parallel algorithms. > >>going to be NUMA, or they will be based on a _very_ expensive crossbar >>to connect processors and memory. Not a BUS. > >Of course. $10 mln for such machines from the past at 16 processors. >$1 mln for a 64 processor itanium2 cc-NUMA > >> >>> >>>A 512 processor cc-NUMA in fact is only 2 times faster latency than a cluster >>>has. >> >> >>This is why discussions with you go nowhere. You mix terms. You redefine >>terms. You make up specification numbers. > >>There are shared memory machines. And there are clusters. Clusters are > >cc-NUMA is shared memory too. I said that. NUMA is _not_ a "cluster" however. > >You can allocate memory like: > a = malloc(100000000000); > >NO PROBLEM. > >Just if you by accident hit a byte that's on a far processor it's a bit slower >:) Again, I've already said that. That is _the_ NUMA problem. > >>_not_ shared memory machines. In a cluster, nobody talks about memory >>latency. Everybody talks about _network_ latency. In a NUMA (or crossbar or > >Wrong. > >The one way pingpong test is used for all those machines at the same time :) Nobody in their right mind done ping-pong on a NUMA cluster. Nor on a pure SMP cluster like a Cray. They do it on message-passing machines _only_. And message-passing machines are _not_ shared memory. > >The shared memory is only a feature the OS delivers, sometimes speeded up by >special hardware hubs :) > >That's why at the origin3800 the memory controller (idem for i/o controller) is >called a hub and at the altix3000 the thing is on the paper 2 times faster and >called shub :) > >>BUS) machine, memory latency is mentioned all the time. >>But _not_ in a cluster. > >>> The advantage is that with a cluster you must use MPI library >> >>I have absolutely no idea what you are talking about. I've been >>programming clusters for 20 years, and I didn't "have to use MPI >>library". I did cluster stuff _before_ MPI existed. Hint: check >>on sockets. Not to mention PVM. OpenMP. UPC from Compaq. Etc. > >Basically you must rewrite every memory access of crafty to a function call, >unless linux is making it one big shared memory. You're too lazy to ever do that >converting. Just like I was too lazy to write DTS in the first place? _you_ had the chance to read about it _first_ and then ask questions, and then implement it. I had to do it _all_ from scratch. Talk about lazy... > >So unless the OS gives you the ability to do that huge malloc, i'm sure crafty >will be never efficiently working at your 8 node quad xeon. I would _never_ write it that way. Fortunately. > >>> and i'm not >>>using it at the great SGI machine. I simply allocate shared memory and >>>communicate through that with my own code. You can call it openMP of course, but >>>it simply is a low level parallellism. >>> >>>The big advantage of cc-NUMA is that you can run jobs of say a processor or 32 >>>with just worst case 2 us latency, under the condition that the OS schedules >>>well. >> >>NUMA scales well. It doesn't perform that great. NUMA is price-driven. > >NUMA scales well and performs well, you just must be a better program than you >are. That's all. > >There's plenty who are. Yes. You aren't one of them, however... > >>_not_ performance-driven. > >Best regards, >Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.