Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Cray

Author: Robert Hyatt

Date: 10:14:28 07/11/03

Go up one level in this thread


On July 10, 2003 at 21:34:04, Jeremiah Penery wrote:

>On July 10, 2003 at 15:59:42, Vincent Diepeveen wrote:
>
>>On July 09, 2003 at 19:21:52, Jeremiah Penery wrote:
>>
>>>On July 09, 2003 at 08:25:39, Vincent Diepeveen wrote:
>>>
>>>>Nevertheless this machine is record breaking and always will be remembered for
>>>>that. Assuming it is designed for big vectors it's quite a bit slower in latency
>>>>then because if you optimize for huge transfers at once then a single transfer
>>>>is probably very pricey.
>>>>
>>>>So let's ignore the latency question, it wasn't designed for it simply.
>>>
>>>Instead of just guessing, why don't you go look it up.  Information is widely
>>>available.
>>>
>>>Here - http://www.sc.doe.gov/ascr/dongarra.pdf - the MPI_PUT latency is listed
>>>as 6.63us.  Everywhere else I've seen lists under 10us, with most being much
>>>closer to 5us.
>>
>>When it goes to some central bottleneck then you never can avoid such huge
>>latencies. For a supercomputer that can do OpenMP till 2048 processors like the
>>Earth machine (if i interpret data in that pdf well) then at 500Mhz with 16
>>instructions a clock (therefore called vector processor) which also can be
>
>No, it's called a vector processor because the vector unit uses 72 vector
>registers, each holding 256 64-bit values, with multiple sets of vector units
>(each with 6 instruction pipelines) designed to operate on them.  Being called a
>'vector processor' has absolutely nothing to do with the ability to do 16
>instructions/clock.
>
>BTW, the vector parts of the chip operate at 1GHz, from what I can tell.  The
>scalar part is 500MHz.
>
>>achieved actually at 8 gflops it is really a great machine for the matrix guys.
>>Most likely that 6.63 latency us is for huge lines of data as they achieve
>>12.xxGB a second with it.
>>
>>Note that MPI_PUT is a one way function. It isn't *waiting* for data to get
>>back.
>
>There are a *lot* of other PDF, PPT, and HTML documents that give slightly
>different figures.
>
>How about this one:  Inter-node MPI communication - Latency  8.6us
>
>http://wwwbode.cs.tum.edu/~gerndt/home/Research/PADC2002/Talks/Kerbyson.pdf and
>at several other sites.
>
>I see "bi-directional MPI communication latency" listed at 8us here:
>http://www.lanl.gov/orgs/ccn/salishan2003/pdf/kerbyson.pdf
>
>Here:
>http://camelback-comparch.com/Scalable%20MicroSupercomputers%20Presentation.pdf
>I see MPI latency listed at 5.6us.
>
>>However if we consider circumstances and the design of the stuff out there that
>>really isn't interesting. Interesting is that they can get 12.xx GB bandwidth
>>with MPI_PUT.
>>
>>This stuff is not designed for chessprograms.
>
>No, but neither was the Pentium4.
>
>>So if we are busy with just getting random cache lines of say 128 bytes at most,
>>then the latency will be more around 20 us at this machine. That's not nice to
>>say however as it is not designed for this.
>
>You're making up numbers again.
>
>>It is designed to put 12.8GB through the central router with a MPI_PUT a second
>>and that is an incredible achievement for node to node.
>
>I'm not claiming that this thing has the lowest latency of anything in the
>world.  I'm only saying that it is very low latency, relative to other very
>large machines.  I don't think that a properly vectorized chess program would
>scale all that badly, even up to the maximum number of processors, because you
>can load a lot of memory into the vector registers and use longer loops to hide
>remote memory access.  I'd guess 10% efficiency would be attainable.  But of
>course, that is only a guess, and impossible to prove right or wrong.

You are absolutely wasting your breath.  I've tried to explain vector processing
to Vincent _many_ times.  IE he said "you can't do good mobility on a Cray."  I
explained to him how I did it with the vector mask / vector merge instructions,
so that I can compute mobility as the sum of squares attacked, where each
square has a different "mobility value" depending on how useful the square is.

It is doable.  I did it.  But until he understands what vectors are all about,
trying to convince him of anything relative to vector processing is hopeless.

He thinks that vectors are simply a short-cut way to do this:

for (i=0;i<n;i++)
  a[i]=b[i]*c[i];

While in reality they can do so much more.  But they require re-thinking
everything, just as bitboards do.  Harry and I vectorized Cray Blitz over
a 10+ year period, continually finding new ways to do things that were slower
on traditional hardware, but which absolutely screamed on vector hardware. The
hash table probe was one such thing.  I could do 8 probes (I did) in no more
time than it took to do one probe.  But I had to think about it for a while
to produce a good algorithm that would vectorize properly.

You may as well give up on the subject of "vector computing".  IMHO.

Vectors have just as much room for "trickiness" as bitboards.  That
says a lot.  And since you notice Vincent has no problem in telling all of
us what bitboards can't do, you see the problem. :)



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.