Author: Vincent Diepeveen
Date: 08:17:31 02/03/05
Go up one level in this thread
On February 02, 2005 at 11:54:46, Robert Hyatt wrote: >On February 02, 2005 at 09:53:03, Vincent Diepeveen wrote: > >>On February 01, 2005 at 21:51:19, Robert Hyatt wrote: >> >>>On February 01, 2005 at 17:19:26, Vincent Diepeveen wrote: >>> >>>>On February 01, 2005 at 16:28:22, Robert Hyatt wrote: >>>> >>>>Still didn't read the subject title? >>>> >>>>[snip] >>>> >>>>>Because a cluster can't offer 1/100th the total memory bandwidth of a big Cray >>>>>vector box. >>>> >>>>Actually todays clusters deliver a factor 1000 more or so. >>>> >>>>Total bandwidth a cluster can deliver is measured nowadays in Terabytes per >>>>second, with Cray it was measured in gigabytes per second. >>> >>>Let's see. The last Cray I ran on with a chess program was a T932. Processor >>>could read 4 words and write two words per cycle, cycle time was 2ns. So 6 >>>words, 48 bytes per cycle, x 500M cycles per second is about 2.5 gigabytes per >>>second, x 32 processors is getting dangerously close to 100 gigabytes per >> >>Bandwidth a cpu at the old MIPS was 3.2 GB/s from memory (origin3000 series) >>and bandwidth at altix3000 using network4 a cpu is 8.2 gigabyte per second from >>memory. >> >>So what Cray streamed there was impressive for its days, but it delivered to >>just a few cpu's, that was the entire main problem. This for massive power >>consumption. >> >>What we speak of now is that you get effectively the same bandwidth from memory >>to each cpu now, but systems go up to 130000+ processors. >> >>>second. A "cluster" can have more theoretical bandwidth, but rarely as much >>>_real_ bandwidth. This is on a shared memory machine that can do real codes >>>quite well. >> >>>> >>>>Note it's the same network that gets used for huge Cray T3E's, but a newer and >>>>bigger version, that's all. >>> >>>T3E isn't a vector computer. >> >>The processor used (alpha) was out of order, yet achieves the same main >>objective, that's executing more than 1 instruction a cycle effectively. >> >>Itanium2 is objectively seen is a vector processor as it executes 2 bundles at >>once. Though they call that IPF nowadays. > >That's not a vector architecture. A vector machine executes _one_ instruction >and produces a large number of results. For example, current cray vector boxes >can produce 128 results by executing a single instruction. That is why MIPS was >dropped and FLOPS became the measure for high-performance computing. IPF executes 2 bundles per cycle. 1 bundle = 3 instructions in IPF You can see that as a vector. > >> >>All x86-64 which are taking over now are doing 3 instructions a cycle now and >>deliver up to 2 flops a cycle. >> >>>>Crays had usually when in vector like what was is 4 cpu's or so? Sometimes up to >>>>128. Above that it was T3E which had alpha's. >>>> >>>>that one used quadrics usually :) >>>> >>>>However look to France now. New great supercomputer. 8192 processors or so. >>>>Say 2048 nodes. You're looking at 3.6 TB per second bandwidth :) >> >>>For a synthetic benchmark, not a real code, that's the problem with clusters so >>>far... >> >>It's the speed the memory delivers to the cpu's. >> >>Nothing synthetic. > >Didn't think you would understand that. It is about "theoretical peak" vs >"sustained peak for real applications". The numbers are not the same. I know you will act innocent here. But you just have no idea of course. A 'cheapo' highend network card can get 800Mb/s. >> >>>>Those Crays you remember were 100Mhz ones. Network could deliver of course >>>>exactly what cpu could calculate. >>> >>>There was no "network" and the crays were 500mhz although on a fully pipelined >>>vector machine that can do 5-10 operations per cycle that is not exactly a good >>>measure of performance. >> >>Operations doesn't count. Flops do. >> >>There is actually 1Ghz Cray here with 256KB cache. > >DOn't know what cray you have there, but the cache is not normal cache. It is >only for "scalar" operations. Vector operations don't go through the scalar >cache on a Cray, because all memory reads and writes are pipelined and deliver >values every clock cycle after the latency delay for the first word. www.cray.com >> >>>>Not so great if you look to the total number of Gflop it delivered. Nowadays the >>>>big clusters, as all big supercomputers nowadays are clusters, are measured in >>>>Tflop and one already in Pflop :) >>>> >>>>There is a 0.36 Pflop one now under construction :) >>>> >>>>Vincent >> >>>Different computer for different applications. Ask a real programmer which he >>>would rather write code for... >> >>They'll all pick the fastest machine, which is the one delivering most flops. >> >>Vincent > >No. Otherwise there would be no machines like the Cray, Fujitsu, Hitachi, etc >left. Using a large number of processors in a message-passing architecture is >not as easy as using a small number of faster processors in a shared-memory >architecture, for many applications. You confuse network cards with network cards. There is also network cards that allow DSM. See www.quadrics.com for details. They have RAM on chip. No need for message passing. You can read straight over the network remote from that RAM without needing a message in the remote cpu to be handled. Note all those highends are in serious financial trouble now. Probably several bankrupt soon. SGI right now can keep alive a bit thanks to intel giving them itanium2 cpu's for near free. Cray already has major problems. They are pretty expensive anyway. 50k dollar for 1 node is what i call expensive. 1 node has 12 cpu's (opterons). Just like about everything above 4 cpu's will die. (Highend) network cards take over. Programmers simply are cheaper than hardware, they will have to adapt to networks. Vincent
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.