Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Cray

Author: Vincent Diepeveen

Date: 12:14:22 07/12/03

Go up one level in this thread


On July 10, 2003 at 21:34:04, Jeremiah Penery wrote:

>On July 10, 2003 at 15:59:42, Vincent Diepeveen wrote:
>
>>On July 09, 2003 at 19:21:52, Jeremiah Penery wrote:
>>
>>>On July 09, 2003 at 08:25:39, Vincent Diepeveen wrote:
>>>
>>>>Nevertheless this machine is record breaking and always will be remembered for
>>>>that. Assuming it is designed for big vectors it's quite a bit slower in latency
>>>>then because if you optimize for huge transfers at once then a single transfer
>>>>is probably very pricey.
>>>>
>>>>So let's ignore the latency question, it wasn't designed for it simply.
>>>
>>>Instead of just guessing, why don't you go look it up.  Information is widely
>>>available.
>>>
>>>Here - http://www.sc.doe.gov/ascr/dongarra.pdf - the MPI_PUT latency is listed
>>>as 6.63us.  Everywhere else I've seen lists under 10us, with most being much
>>>closer to 5us.
>>
>>When it goes to some central bottleneck then you never can avoid such huge
>>latencies. For a supercomputer that can do OpenMP till 2048 processors like the
>>Earth machine (if i interpret data in that pdf well) then at 500Mhz with 16
>>instructions a clock (therefore called vector processor) which also can be
>
>No, it's called a vector processor because the vector unit uses 72 vector
>registers, each holding 256 64-bit values, with multiple sets of vector units
>(each with 6 instruction pipelines) designed to operate on them.  Being called a
>'vector processor' has absolutely nothing to do with the ability to do 16
>instructions/clock.
>
>BTW, the vector parts of the chip operate at 1GHz, from what I can tell.  The
>scalar part is 500MHz.
>
>>achieved actually at 8 gflops it is really a great machine for the matrix guys.
>>Most likely that 6.63 latency us is for huge lines of data as they achieve
>>12.xxGB a second with it.
>>
>>Note that MPI_PUT is a one way function. It isn't *waiting* for data to get
>>back.
>
>There are a *lot* of other PDF, PPT, and HTML documents that give slightly
>different figures.
>
>How about this one:  Inter-node MPI communication - Latency  8.6us

That's probably the one way latency. If so then you need to multiply it by 2
times to get the time needed to get a position for a chessprogram.

Say 17.2 us. Sounds about right.

Note IBM clusters as delivered nowadays have 5-7 us one way latency. those are
optimized clusters for latency. So that's 10 us a cache line. Very good latency!

http://wwwbode.cs.tum.edu/~gerndt/home/Research/PADC2002/Talks/Kerbyson.pdf and
at several other sites.
>
>I see "bi-directional MPI communication latency" listed at 8us here:
>http://www.lanl.gov/orgs/ccn/salishan2003/pdf/kerbyson.pdf
>
>Here:
>http://camelback-comparch.com/Scalable%20MicroSupercomputers%20Presentation.pdf
>I see MPI latency listed at 5.6us.



>>However if we consider circumstances and the design of the stuff out there that
>>really isn't interesting. Interesting is that they can get 12.xx GB bandwidth
>>with MPI_PUT.
>>
>>This stuff is not designed for chessprograms.
>
>No, but neither was the Pentium4.
>
>>So if we are busy with just getting random cache lines of say 128 bytes at most,
>>then the latency will be more around 20 us at this machine. That's not nice to
>>say however as it is not designed for this.
>
>You're making up numbers again.
>
>>It is designed to put 12.8GB through the central router with a MPI_PUT a second
>>and that is an incredible achievement for node to node.
>
>I'm not claiming that this thing has the lowest latency of anything in the
>world.  I'm only saying that it is very low latency, relative to other very
>large machines.

The latency of the Earth machine if we consider its number of processors and
that it is a vector machine is very good.

>I don't think that a properly vectorized chess program would
>scale all that badly, even up to the maximum number of processors, because you

Cray Blitz. 16 vector processors (Cray) clocked at 100Mhz.

In total 500k nps

That was a very good vectorizing job of Bob in your opinion?

>can load a lot of memory into the vector registers and use longer loops to hide
>remote memory access.  I'd guess 10% efficiency would be attainable.  But of
>course, that is only a guess, and impossible to prove right or wrong.

Trivially i try to improve speedup efficiency of DIEP still. Mentionning a
percentile now would mean this number will get quoted even when i already get a
better efficiency. Also note that current tests are at partitions which are not
entirely reserved for me. There's usually another 150 users at them.

Some software that's running at the machine which i am busy helping again after
the holidays to run better at the machine is eating up all bandwidth.

At the other partition a 350000 hours 10 TB database gets created to predict the
extreme peaks in the weather the coming x years.

So trivially all the bandwidth is kind of gone and memory from your nodes gets
used for the weather guys ;)

So if you excuse me now i will be more detailed in about a month or 4 from now.








This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.