Author: Robert Hyatt
Date: 08:59:11 02/03/05
Go up one level in this thread
On February 03, 2005 at 11:17:31, Vincent Diepeveen wrote: >On February 02, 2005 at 11:54:46, Robert Hyatt wrote: > >>On February 02, 2005 at 09:53:03, Vincent Diepeveen wrote: >> >>>On February 01, 2005 at 21:51:19, Robert Hyatt wrote: >>> >>>>On February 01, 2005 at 17:19:26, Vincent Diepeveen wrote: >>>> >>>>>On February 01, 2005 at 16:28:22, Robert Hyatt wrote: >>>>> >>>>>Still didn't read the subject title? >>>>> >>>>>[snip] >>>>> >>>>>>Because a cluster can't offer 1/100th the total memory bandwidth of a big Cray >>>>>>vector box. >>>>> >>>>>Actually todays clusters deliver a factor 1000 more or so. >>>>> >>>>>Total bandwidth a cluster can deliver is measured nowadays in Terabytes per >>>>>second, with Cray it was measured in gigabytes per second. >>>> >>>>Let's see. The last Cray I ran on with a chess program was a T932. Processor >>>>could read 4 words and write two words per cycle, cycle time was 2ns. So 6 >>>>words, 48 bytes per cycle, x 500M cycles per second is about 2.5 gigabytes per >>>>second, x 32 processors is getting dangerously close to 100 gigabytes per >>> >>>Bandwidth a cpu at the old MIPS was 3.2 GB/s from memory (origin3000 series) >>>and bandwidth at altix3000 using network4 a cpu is 8.2 gigabyte per second from >>>memory. >>> >>>So what Cray streamed there was impressive for its days, but it delivered to >>>just a few cpu's, that was the entire main problem. This for massive power >>>consumption. >>> >>>What we speak of now is that you get effectively the same bandwidth from memory >>>to each cpu now, but systems go up to 130000+ processors. >>> >>>>second. A "cluster" can have more theoretical bandwidth, but rarely as much >>>>_real_ bandwidth. This is on a shared memory machine that can do real codes >>>>quite well. >>> >>>>> >>>>>Note it's the same network that gets used for huge Cray T3E's, but a newer and >>>>>bigger version, that's all. >>>> >>>>T3E isn't a vector computer. >>> >>>The processor used (alpha) was out of order, yet achieves the same main >>>objective, that's executing more than 1 instruction a cycle effectively. >>> >>>Itanium2 is objectively seen is a vector processor as it executes 2 bundles at >>>once. Though they call that IPF nowadays. >> >>That's not a vector architecture. A vector machine executes _one_ instruction >>and produces a large number of results. For example, current cray vector boxes >>can produce 128 results by executing a single instruction. That is why MIPS was >>dropped and FLOPS became the measure for high-performance computing. > >IPF executes 2 bundles per cycle. > >1 bundle = 3 instructions in IPF > >You can see that as a vector. Maybe _you_ can see that as a vector. No person familiar with computer architecture sees that as a vector. No architecture textbook calls that a vector. I stick to common definitions of words, not your privately twisted definitions that nobody can communicate with. Pick up a copy of Hennessy/Patterson's architecture book and look up "vector operations" in the index. VLIW is _not_ vector. "bundles" are _not_ vector. > >> >>> >>>All x86-64 which are taking over now are doing 3 instructions a cycle now and >>>deliver up to 2 flops a cycle. >>> >>>>>Crays had usually when in vector like what was is 4 cpu's or so? Sometimes up to >>>>>128. Above that it was T3E which had alpha's. >>>>> >>>>>that one used quadrics usually :) >>>>> >>>>>However look to France now. New great supercomputer. 8192 processors or so. >>>>>Say 2048 nodes. You're looking at 3.6 TB per second bandwidth :) >>> >>>>For a synthetic benchmark, not a real code, that's the problem with clusters so >>>>far... >>> >>>It's the speed the memory delivers to the cpu's. >>> >>>Nothing synthetic. >> >>Didn't think you would understand that. It is about "theoretical peak" vs >>"sustained peak for real applications". The numbers are not the same. > >I know you will act innocent here. But you just have no idea of course. > >A 'cheapo' highend network card can get 800Mb/s. Not if two machines try to talk to the same node it can't. Not if there is congestion in the router it can't. Not if there are multiple router hops between the two points it can't. Etc... > >>> >>>>>Those Crays you remember were 100Mhz ones. Network could deliver of course >>>>>exactly what cpu could calculate. >>>> >>>>There was no "network" and the crays were 500mhz although on a fully pipelined >>>>vector machine that can do 5-10 operations per cycle that is not exactly a good >>>>measure of performance. >>> >>>Operations doesn't count. Flops do. >>> >>>There is actually 1Ghz Cray here with 256KB cache. >> >>DOn't know what cray you have there, but the cache is not normal cache. It is >>only for "scalar" operations. Vector operations don't go through the scalar >>cache on a Cray, because all memory reads and writes are pipelined and deliver >>values every clock cycle after the latency delay for the first word. > >www.cray.com What should I look for? I have the manuals for every vector machine they have ever produced/shipped in my office. > >>> >>>>>Not so great if you look to the total number of Gflop it delivered. Nowadays the >>>>>big clusters, as all big supercomputers nowadays are clusters, are measured in >>>>>Tflop and one already in Pflop :) >>>>> >>>>>There is a 0.36 Pflop one now under construction :) >>>>> >>>>>Vincent >>> >>>>Different computer for different applications. Ask a real programmer which he >>>>would rather write code for... >>> >>>They'll all pick the fastest machine, which is the one delivering most flops. >>> >>>Vincent >> >>No. Otherwise there would be no machines like the Cray, Fujitsu, Hitachi, etc >>left. Using a large number of processors in a message-passing architecture is >>not as easy as using a small number of faster processors in a shared-memory >>architecture, for many applications. > >You confuse network cards with network cards. There is also network cards that >allow DSM. See www.quadrics.com for details. They have RAM on chip. No need for >message passing. You can read straight over the network remote from that RAM >without needing a message in the remote cpu to be handled. Please get into the world of standard definitions. Does not matter whether the remote CPU has to see the "message" or not. It is still "message passing". The VIA architecture used by our cLAN stuff supports that type of shared memory, but the two cards still send messages over the network router. Shared memory systems don't send "messages". > >Note all those highends are in serious financial trouble now. >Probably several bankrupt soon. SGI right now can keep alive a bit thanks to >intel giving them itanium2 cpu's for near free. Cray already has major problems. >They are pretty expensive anyway. 50k dollar for 1 node is what i call >expensive. 1 node has 12 cpu's (opterons). > >Just like about everything above 4 cpu's will die. (Highend) network cards take >over. > >Programmers simply are cheaper than hardware, they will have to adapt to >networks. > >Vincent That last statement is so far beyond false it takes sunlight six years from the time it reaches "false" until it reaches that statement. _any_ good CS book today _always_ contains the quote "In the 60's, the cost of developing software was _far_ exceeded by the cost of the hardware it was run on, so efficiency of the programming itself was paramount. Today the cost of developing the software far exceeds the cost of the system it runs on, so controling the development cost is what software engineering is all about." That was not just a wrong statement, it was a _grossly_ wrong statement. Pick up any good software engineering / software development text book, and learn something. Who out there besides Vincent thinks hardware costs exceed software development costs in today's computing world???
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.