Computer Chess Club Archives


Search

Terms

Messages

Subject: Cray

Author: Vincent Diepeveen

Date: 21:09:03 07/08/03

Go up one level in this thread


On July 08, 2003 at 19:37:48, Jeremiah Penery wrote:

>On July 08, 2003 at 08:37:49, Vincent Diepeveen wrote:
>
>>On July 08, 2003 at 00:33:09, Jeremiah Penery wrote:
>>
>>>NEC Earth Simulator has 5120 NEC SX-7(?) vector processors.  Total cost was less
>>>than $400m.
>>
>>around $680M it cost.
>
>Provide a reference for that $680m number, and I might believe you.  I don't
>accept random numbers without reference.
>
>Less than $400m is quoted at these sites:
>http://www.mindfully.org/Technology/Supercomputer-Japanese23jul02.htm
>http://www.siliconvalley.com/mld/siliconvalley/news/editorial/3709294.htm
>http://www.time.com/time/2002/inventions/rob_earth.html
>http://www-zeuthen.desy.de/~schoene/unter_texte/texte/sc2002/tsld004.htm
>http://www.iht.com/articles/98820.html
>http://cospa.phys.ntu.edu.tw/aapps/v12n2/v12-2n1.pdf
>etc., etc.
>
>The highest price I've seen is around $500m, nowhere near your number.
>
>>>Here is a blurb about the chip, from the webpage:
>>>
>>>"Each AP consists of a 4-way super-scalar unit (SU), a vector unit (VU), and
>>>main memory access control unit on a single LSI chip. The AP operates at a clock
>>>frequency of 500MHz with some circuits operating at 1GHz. Each SU is a
>>>super-scalar processor with 64KB instruction caches, 64KB data caches, and 128
>>>general-purpose scalar registers. Branch prediction, data prefetching and
>>>out-of-order instruction execution are all employed. Each VU has 72 vector
>>>registers, each of which can has 256 vector elements, along with 8 sets of six
>>>different types of vector pipelines: addition/shifting, multiplication,
>>>division, logical operations, masking, and load/store. The same type of vector
>>>pipelines works together by a single vector instruction and pipelines of
>>>different types can operate concurrently."
>>>
>>>Each chip consumes only about 140W, rather than Vincent's assertion of 150KW.
>>
>>the 125KW is for Cray 'processors' not fujitsu processors that are in the NEC
>>machine.
>>
>>Ask bob i remember he quoted 500 kilowatt for a 4 processor Cray. So i divided
>>that by 4.
>
>That 500KW was probably for the entire machine.  Each processor probably

Yes a 4 processor Cray.

Just for your own understanding of what a cray is. it is NOT a processor.
It is a big block of electronics put together. So no wonder it eats quite a bit
more than the average cpu.

That's why i say that those power consuming Crays are history. They are just too
expensive in power imho. If we then compare that they run at 1Ghz and can do
like 29 instructions with 256 KB cache, then it is trivial why those matrix
wonders no longer are a wonder.

Opterons, Itaniums. You might call them expensive in power. It is trivial that
they are very fast compared to a Cray when you compare the power consumption.

A special water central was typically used to cool those vector Crays. Bob can
tell more about that. He has had one there at his university.

>consumes a very small amount of that.  The Earth Simulator uses some 7MW of
>power in total, though only about 10% comes from the processors.

The typical supercomputer has a fast i/o and big routers. Those always eat
trivially more power than the cpu's.

7 MW nevertheless is hell of a lot.

From chess viewpoint the only interesting thing is what is the one way pingpong
latency time of the Earth Simulator at the big partitions which work with either
MPI or openmp. Doesn't matter what of course. Of course not from processors near
each other but with some routers in between them ;)

Another major difference with Cray machines (using cray processor blocks) is
typically not using too many processors, because all processors are cross
connected with very fast connections. No clever routing system at all. Brute
force.

If you want to make a supercomputer which is having big partitions of cpu's you
need somewhere a compression point where n cpu's compress to a single bottleneck
and then with some kind of router or special designed NUMA flex (that's the very
fast SGI thing where they connect boxes of 64 processors to each other with).

Cray never accepted such bottlenecks. It was just raw vector power. If you
consider *when* those machines were constructed it was really a genius thing.

It's only now that cpu's are so very well designed and high clocked with many
instructions a clock that those vector blocks can be replaced safely.

Note i bet they still get used because most scientist know shit from programming
and you can't blame them.

Today i spoke with someone who is running jobs a lot. What he calls a small job
is a calculatoin at 24 processors that runs for 20 hours just doing floating
point calculations.

His software runs already for like 20 years or so at supercomputers.

There is however some major differences with today and back then, that's why we
spoke. I had promised him to help him speedup.

What he is doing is that a processor has huge 3 dimensional arrays where he gets
data from.

Those are however allocated at the first thread that starts.

So imagine that 1 poor thread is eating up all that bandwidth of the machine and
that each cache line to get there takes like 5 microseconds or so to arrive.

Then he can do 16 calculations (cache line length: 128 bytes divided by double
size = 8 bytes). That's sick expensive.

His software can be speeded up *quite* a lot.

Trivially he ran also in the past at Crays with this software (nowadays it's in
C, previously it was in fortran).

They just do not know the bottlenecks of todays supercomputers.

That's why the Cray for them was a great thing and always they will remember it
for that.

Because if you got a processor or 16 with shared memory and for every processor
a lookup in that memory is equally fast, then it is trivial that this program,
which definitely is a good example of how many programs still are, can be
speeded up like 20 times easily at this SGI supercomputer.

Yet the brute force of the Cray doesn't distinguish. So the Cray computer is
even greater if you realize the average guy who has to do calculations on those
machine.

Up till recently more than 50% of the total system time goes to researchers who
are doing physics (if that's the right english word). Calculation of models and
oil simulations and bunches of known algorithms and unknown new ones that get
tried with major matrixes.

In this case it was field calculations. Most of the researchers are already so
happy that they can run in parallel on a machine that we'll forgive them that
they do some stuff wrong.

In all cases they draw the conclusion that the cpu is eating up the system time,
because even if your program is 99% busy with calling cache lines from some
remote node, the 'top' is showing that processes are busy 99.xx% of the system
time.

let's quote Seymour Cray:
  "If you were plowing a field, which would you rather use?
   Two strong oxen or 1024 chickens?"

It's trivial that only the best programmers on the planet can go for that 1024
chickens.



>>Trivially Cray machines using the opterons will be consuming less than that.
>>Note that the cpu costs is nothing compared to what the routers etc eat.
>
>Of course.





This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.