Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Cray

Author: Robert Hyatt

Date: 13:36:50 07/10/03

Go up one level in this thread


On July 09, 2003 at 19:10:01, Vincent Diepeveen wrote:

>On July 09, 2003 at 15:57:36, Robert Hyatt wrote:
>
>>On July 09, 2003 at 00:09:03, Vincent Diepeveen wrote:
>>
>>>On July 08, 2003 at 19:37:48, Jeremiah Penery wrote:
>>>
>>>>On July 08, 2003 at 08:37:49, Vincent Diepeveen wrote:
>>>>
>>>>>On July 08, 2003 at 00:33:09, Jeremiah Penery wrote:
>>>>>
>>>>>>NEC Earth Simulator has 5120 NEC SX-7(?) vector processors.  Total cost was less
>>>>>>than $400m.
>>>>>
>>>>>around $680M it cost.
>>>>
>>>>Provide a reference for that $680m number, and I might believe you.  I don't
>>>>accept random numbers without reference.
>>>>
>>>>Less than $400m is quoted at these sites:
>>>>http://www.mindfully.org/Technology/Supercomputer-Japanese23jul02.htm
>>>>http://www.siliconvalley.com/mld/siliconvalley/news/editorial/3709294.htm
>>>>http://www.time.com/time/2002/inventions/rob_earth.html
>>>>http://www-zeuthen.desy.de/~schoene/unter_texte/texte/sc2002/tsld004.htm
>>>>http://www.iht.com/articles/98820.html
>>>>http://cospa.phys.ntu.edu.tw/aapps/v12n2/v12-2n1.pdf
>>>>etc., etc.
>>>>
>>>>The highest price I've seen is around $500m, nowhere near your number.
>>>>
>>>>>>Here is a blurb about the chip, from the webpage:
>>>>>>
>>>>>>"Each AP consists of a 4-way super-scalar unit (SU), a vector unit (VU), and
>>>>>>main memory access control unit on a single LSI chip. The AP operates at a clock
>>>>>>frequency of 500MHz with some circuits operating at 1GHz. Each SU is a
>>>>>>super-scalar processor with 64KB instruction caches, 64KB data caches, and 128
>>>>>>general-purpose scalar registers. Branch prediction, data prefetching and
>>>>>>out-of-order instruction execution are all employed. Each VU has 72 vector
>>>>>>registers, each of which can has 256 vector elements, along with 8 sets of six
>>>>>>different types of vector pipelines: addition/shifting, multiplication,
>>>>>>division, logical operations, masking, and load/store. The same type of vector
>>>>>>pipelines works together by a single vector instruction and pipelines of
>>>>>>different types can operate concurrently."
>>>>>>
>>>>>>Each chip consumes only about 140W, rather than Vincent's assertion of 150KW.
>>>>>
>>>>>the 125KW is for Cray 'processors' not fujitsu processors that are in the NEC
>>>>>machine.
>>>>>
>>>>>Ask bob i remember he quoted 500 kilowatt for a 4 processor Cray. So i divided
>>>>>that by 4.
>>>>
>>>>That 500KW was probably for the entire machine.  Each processor probably
>>>
>>>Yes a 4 processor Cray.
>>>
>>>Just for your own understanding of what a cray is. it is NOT a processor.
>>>It is a big block of electronics put together. So no wonder it eats quite a bit
>>>more than the average cpu.
>>>
>>>That's why i say that those power consuming Crays are history. They are just too
>>>expensive in power imho. If we then compare that they run at 1Ghz and can do
>>>like 29 instructions with 256 KB cache, then it is trivial why those matrix
>>>wonders no longer are a wonder.
>>>
>>>Opterons, Itaniums. You might call them expensive in power. It is trivial that
>>>they are very fast compared to a Cray when you compare the power consumption.
>>>
>>>A special water central was typically used to cool those vector Crays. Bob can
>>>tell more about that. He has had one there at his university.
>>>
>>>>consumes a very small amount of that.  The Earth Simulator uses some 7MW of
>>>>power in total, though only about 10% comes from the processors.
>>>
>>>The typical supercomputer has a fast i/o and big routers. Those always eat
>>>trivially more power than the cpu's.
>>>
>>>7 MW nevertheless is hell of a lot.
>>>
>>>From chess viewpoint the only interesting thing is what is the one way pingpong
>>>latency time of the Earth Simulator at the big partitions which work with either
>>>MPI or openmp. Doesn't matter what of course. Of course not from processors near
>>>each other but with some routers in between them ;)
>>>
>>>Another major difference with Cray machines (using cray processor blocks) is
>>>typically not using too many processors, because all processors are cross
>>>connected with very fast connections. No clever routing system at all. Brute
>>>force.\
>>
>>Pure cross-bar, the best routing there is.
>>
>>
>>>
>>>If you want to make a supercomputer which is having big partitions of cpu's you
>>>need somewhere a compression point where n cpu's compress to a single bottleneck
>>>and then with some kind of router or special designed NUMA flex (that's the very
>>>fast SGI thing where they connect boxes of 64 processors to each other with).
>>>
>>>Cray never accepted such bottlenecks. It was just raw vector power. If you
>>>consider *when* those machines were constructed it was really a genius thing.
>>>
>>>It's only now that cpu's are so very well designed and high clocked with many
>>>instructions a clock that those vector blocks can be replaced safely.
>>>
>>>Note i bet they still get used because most scientist know shit from programming
>>>and you can't blame them.
>>
>>Sorry, but a Cray will blow the doors off of _any_ microcomputer you care to
>>march up.  It can sustain a ridiculous number of operations per cycle.  IE it
>
>Gotta love your comparisions :)
>
>You show up with a cray supercomputer and i may only bring something my hands
>can carry :)

Feel free to do so.  I'll take a T932 over _anything_ you can carry by
hand, no questions asked.


>
>I would prefer to show up with the nowadays 1440 processor and 3 gflops teras
>though :)
>
>>is _easy_ on a single CPU to add two 64 bit floats, multiply the sum by
>>another 64 bit float, add that to another 64 bit float.  And I can do all of
>>that, two results per clock cycle, _forever_.
>>
>>You have to understand vector processing first, to understand the power of a
>>Cray.  Until you grasp that, you are talking nonsense.
>
>>>
>>>Today i spoke with someone who is running jobs a lot. What he calls a small job
>>>is a calculatoin at 24 processors that runs for 20 hours just doing floating
>>>point calculations.
>>>
>>>His software runs already for like 20 years or so at supercomputers.
>>>
>>>There is however some major differences with today and back then, that's why we
>>>spoke. I had promised him to help him speedup.
>>>
>>>What he is doing is that a processor has huge 3 dimensional arrays where he gets
>>>data from.
>>>
>>>Those are however allocated at the first thread that starts.
>>>
>>>So imagine that 1 poor thread is eating up all that bandwidth of the machine and
>>>that each cache line to get there takes like 5 microseconds or so to arrive.
>>>
>>>Then he can do 16 calculations (cache line length: 128 bytes divided by double
>>>size = 8 bytes). That's sick expensive.
>>>
>>>His software can be speeded up *quite* a lot.
>>>
>>>Trivially he ran also in the past at Crays with this software (nowadays it's in
>>>C, previously it was in fortran).
>>>
>>>They just do not know the bottlenecks of todays supercomputers.
>>>
>>>That's why the Cray for them was a great thing and always they will remember it
>>>for that.
>>>
>>>Because if you got a processor or 16 with shared memory and for every processor
>>>a lookup in that memory is equally fast, then it is trivial that this program,
>>>which definitely is a good example of how many programs still are, can be
>>>speeded up like 20 times easily at this SGI supercomputer.
>>>
>>>Yet the brute force of the Cray doesn't distinguish. So the Cray computer is
>>>even greater if you realize the average guy who has to do calculations on those
>>>machine.
>>>
>>>Up till recently more than 50% of the total system time goes to researchers who
>>>are doing physics (if that's the right english word). Calculation of models and
>>>oil simulations and bunches of known algorithms and unknown new ones that get
>>>tried with major matrixes.
>>
>>False.  They are used to design other microprocessors.  Apple owns several.
>>They are used for weather forecasting.  Simulations.  _anything_ that requires
>>incredibly high operations per second on large data arrays.  NUMA just doesn't
>>cut it for many such applications, and message-passing is worse.
>>_that_ is the "world of the Crays" and they are untouched there.
>
>I'm not sure about the microprocessor designs, we can ask AMD and intel after
>it. Apple doesn't produce microprocessors at all. They use IBM processors
>nowadays and before IBM they used Motorola.

Apple produces _machines_.  They do circuit layout and testing on a Cray.

>
>However about the weather forecasting guess why the 1024 processor from december
>2002 till end of gulfwar II was overloaded with weather guys :)
>
>It was like this. On average 400 cpu's got used up until december. Then suddenly
>a dang at the machine. When i checked out which dudes prevented me from doing a
>few tests, i knew it was going to be war soon.
>
>Weather guys LOVE memory. For them vector processing isn't so important as is a
>huge memory.

They are related.  Vector processing lets you _use_ "huge memory" efficiently.


>
>I remember a weather guy some 7 years ago who as a selfemployed managed to lay
>his hands on an outdated Sun machine with 2 processors. He was in the skies so
>happy. I asked him then why he was so happy with those dusted cpu's and he
>explained that he didn't care for the cpu's but for the 2 GB memory inside :)

Cray's don't come with 2 gigs of memory.  The T90 typically has 16-32 gigs.

>
>>>
>>>In this case it was field calculations. Most of the researchers are already so
>>>happy that they can run in parallel on a machine that we'll forgive them that
>>>they do some stuff wrong.
>>>
>>>In all cases they draw the conclusion that the cpu is eating up the system time,
>>>because even if your program is 99% busy with calling cache lines from some
>>>remote node, the 'top' is showing that processes are busy 99.xx% of the system
>>>time.
>>>
>>>let's quote Seymour Cray:
>>>  "If you were plowing a field, which would you rather use?
>>>   Two strong oxen or 1024 chickens?"
>>>
>>>It's trivial that only the best programmers on the planet can go for that 1024
>>>chickens.
>>>
>>
>>And for a good programmer, those two oxen are going to win the race.
>>
>>
>>>>>Trivially Cray machines using the opterons will be consuming less than that.
>>>>>Note that the cpu costs is nothing compared to what the routers etc eat.
>>>>
>>>>Of course.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.