Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Cray

Author: Vincent Diepeveen

Date: 16:10:01 07/09/03

Go up one level in this thread


On July 09, 2003 at 15:57:36, Robert Hyatt wrote:

>On July 09, 2003 at 00:09:03, Vincent Diepeveen wrote:
>
>>On July 08, 2003 at 19:37:48, Jeremiah Penery wrote:
>>
>>>On July 08, 2003 at 08:37:49, Vincent Diepeveen wrote:
>>>
>>>>On July 08, 2003 at 00:33:09, Jeremiah Penery wrote:
>>>>
>>>>>NEC Earth Simulator has 5120 NEC SX-7(?) vector processors.  Total cost was less
>>>>>than $400m.
>>>>
>>>>around $680M it cost.
>>>
>>>Provide a reference for that $680m number, and I might believe you.  I don't
>>>accept random numbers without reference.
>>>
>>>Less than $400m is quoted at these sites:
>>>http://www.mindfully.org/Technology/Supercomputer-Japanese23jul02.htm
>>>http://www.siliconvalley.com/mld/siliconvalley/news/editorial/3709294.htm
>>>http://www.time.com/time/2002/inventions/rob_earth.html
>>>http://www-zeuthen.desy.de/~schoene/unter_texte/texte/sc2002/tsld004.htm
>>>http://www.iht.com/articles/98820.html
>>>http://cospa.phys.ntu.edu.tw/aapps/v12n2/v12-2n1.pdf
>>>etc., etc.
>>>
>>>The highest price I've seen is around $500m, nowhere near your number.
>>>
>>>>>Here is a blurb about the chip, from the webpage:
>>>>>
>>>>>"Each AP consists of a 4-way super-scalar unit (SU), a vector unit (VU), and
>>>>>main memory access control unit on a single LSI chip. The AP operates at a clock
>>>>>frequency of 500MHz with some circuits operating at 1GHz. Each SU is a
>>>>>super-scalar processor with 64KB instruction caches, 64KB data caches, and 128
>>>>>general-purpose scalar registers. Branch prediction, data prefetching and
>>>>>out-of-order instruction execution are all employed. Each VU has 72 vector
>>>>>registers, each of which can has 256 vector elements, along with 8 sets of six
>>>>>different types of vector pipelines: addition/shifting, multiplication,
>>>>>division, logical operations, masking, and load/store. The same type of vector
>>>>>pipelines works together by a single vector instruction and pipelines of
>>>>>different types can operate concurrently."
>>>>>
>>>>>Each chip consumes only about 140W, rather than Vincent's assertion of 150KW.
>>>>
>>>>the 125KW is for Cray 'processors' not fujitsu processors that are in the NEC
>>>>machine.
>>>>
>>>>Ask bob i remember he quoted 500 kilowatt for a 4 processor Cray. So i divided
>>>>that by 4.
>>>
>>>That 500KW was probably for the entire machine.  Each processor probably
>>
>>Yes a 4 processor Cray.
>>
>>Just for your own understanding of what a cray is. it is NOT a processor.
>>It is a big block of electronics put together. So no wonder it eats quite a bit
>>more than the average cpu.
>>
>>That's why i say that those power consuming Crays are history. They are just too
>>expensive in power imho. If we then compare that they run at 1Ghz and can do
>>like 29 instructions with 256 KB cache, then it is trivial why those matrix
>>wonders no longer are a wonder.
>>
>>Opterons, Itaniums. You might call them expensive in power. It is trivial that
>>they are very fast compared to a Cray when you compare the power consumption.
>>
>>A special water central was typically used to cool those vector Crays. Bob can
>>tell more about that. He has had one there at his university.
>>
>>>consumes a very small amount of that.  The Earth Simulator uses some 7MW of
>>>power in total, though only about 10% comes from the processors.
>>
>>The typical supercomputer has a fast i/o and big routers. Those always eat
>>trivially more power than the cpu's.
>>
>>7 MW nevertheless is hell of a lot.
>>
>>From chess viewpoint the only interesting thing is what is the one way pingpong
>>latency time of the Earth Simulator at the big partitions which work with either
>>MPI or openmp. Doesn't matter what of course. Of course not from processors near
>>each other but with some routers in between them ;)
>>
>>Another major difference with Cray machines (using cray processor blocks) is
>>typically not using too many processors, because all processors are cross
>>connected with very fast connections. No clever routing system at all. Brute
>>force.\
>
>Pure cross-bar, the best routing there is.
>
>
>>
>>If you want to make a supercomputer which is having big partitions of cpu's you
>>need somewhere a compression point where n cpu's compress to a single bottleneck
>>and then with some kind of router or special designed NUMA flex (that's the very
>>fast SGI thing where they connect boxes of 64 processors to each other with).
>>
>>Cray never accepted such bottlenecks. It was just raw vector power. If you
>>consider *when* those machines were constructed it was really a genius thing.
>>
>>It's only now that cpu's are so very well designed and high clocked with many
>>instructions a clock that those vector blocks can be replaced safely.
>>
>>Note i bet they still get used because most scientist know shit from programming
>>and you can't blame them.
>
>Sorry, but a Cray will blow the doors off of _any_ microcomputer you care to
>march up.  It can sustain a ridiculous number of operations per cycle.  IE it

Gotta love your comparisions :)

You show up with a cray supercomputer and i may only bring something my hands
can carry :)

I would prefer to show up with the nowadays 1440 processor and 3 gflops teras
though :)

>is _easy_ on a single CPU to add two 64 bit floats, multiply the sum by
>another 64 bit float, add that to another 64 bit float.  And I can do all of
>that, two results per clock cycle, _forever_.
>
>You have to understand vector processing first, to understand the power of a
>Cray.  Until you grasp that, you are talking nonsense.

>>
>>Today i spoke with someone who is running jobs a lot. What he calls a small job
>>is a calculatoin at 24 processors that runs for 20 hours just doing floating
>>point calculations.
>>
>>His software runs already for like 20 years or so at supercomputers.
>>
>>There is however some major differences with today and back then, that's why we
>>spoke. I had promised him to help him speedup.
>>
>>What he is doing is that a processor has huge 3 dimensional arrays where he gets
>>data from.
>>
>>Those are however allocated at the first thread that starts.
>>
>>So imagine that 1 poor thread is eating up all that bandwidth of the machine and
>>that each cache line to get there takes like 5 microseconds or so to arrive.
>>
>>Then he can do 16 calculations (cache line length: 128 bytes divided by double
>>size = 8 bytes). That's sick expensive.
>>
>>His software can be speeded up *quite* a lot.
>>
>>Trivially he ran also in the past at Crays with this software (nowadays it's in
>>C, previously it was in fortran).
>>
>>They just do not know the bottlenecks of todays supercomputers.
>>
>>That's why the Cray for them was a great thing and always they will remember it
>>for that.
>>
>>Because if you got a processor or 16 with shared memory and for every processor
>>a lookup in that memory is equally fast, then it is trivial that this program,
>>which definitely is a good example of how many programs still are, can be
>>speeded up like 20 times easily at this SGI supercomputer.
>>
>>Yet the brute force of the Cray doesn't distinguish. So the Cray computer is
>>even greater if you realize the average guy who has to do calculations on those
>>machine.
>>
>>Up till recently more than 50% of the total system time goes to researchers who
>>are doing physics (if that's the right english word). Calculation of models and
>>oil simulations and bunches of known algorithms and unknown new ones that get
>>tried with major matrixes.
>
>False.  They are used to design other microprocessors.  Apple owns several.
>They are used for weather forecasting.  Simulations.  _anything_ that requires
>incredibly high operations per second on large data arrays.  NUMA just doesn't
>cut it for many such applications, and message-passing is worse.
>_that_ is the "world of the Crays" and they are untouched there.

I'm not sure about the microprocessor designs, we can ask AMD and intel after
it. Apple doesn't produce microprocessors at all. They use IBM processors
nowadays and before IBM they used Motorola.

However about the weather forecasting guess why the 1024 processor from december
2002 till end of gulfwar II was overloaded with weather guys :)

It was like this. On average 400 cpu's got used up until december. Then suddenly
a dang at the machine. When i checked out which dudes prevented me from doing a
few tests, i knew it was going to be war soon.

Weather guys LOVE memory. For them vector processing isn't so important as is a
huge memory.

I remember a weather guy some 7 years ago who as a selfemployed managed to lay
his hands on an outdated Sun machine with 2 processors. He was in the skies so
happy. I asked him then why he was so happy with those dusted cpu's and he
explained that he didn't care for the cpu's but for the 2 GB memory inside :)

>>
>>In this case it was field calculations. Most of the researchers are already so
>>happy that they can run in parallel on a machine that we'll forgive them that
>>they do some stuff wrong.
>>
>>In all cases they draw the conclusion that the cpu is eating up the system time,
>>because even if your program is 99% busy with calling cache lines from some
>>remote node, the 'top' is showing that processes are busy 99.xx% of the system
>>time.
>>
>>let's quote Seymour Cray:
>>  "If you were plowing a field, which would you rather use?
>>   Two strong oxen or 1024 chickens?"
>>
>>It's trivial that only the best programmers on the planet can go for that 1024
>>chickens.
>>
>
>And for a good programmer, those two oxen are going to win the race.
>
>
>>>>Trivially Cray machines using the opterons will be consuming less than that.
>>>>Note that the cpu costs is nothing compared to what the routers etc eat.
>>>
>>>Of course.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.