Computer Chess Club Archives


Search

Terms

Messages

Subject: But that PC with crafty would beat Cray Blitz though

Author: Vincent Diepeveen

Date: 21:26:15 02/12/03

Go up one level in this thread


On February 12, 2003 at 00:37:13, Robert Hyatt wrote:

>On February 11, 2003 at 23:24:43, Tom Kerrigan wrote:
>
>>On February 11, 2003 at 22:39:48, Robert Hyatt wrote:
>>
>>>Your explanation was not bad, but your "no compiler can do this" is dead
>>>wrong.  Visit Cray Research, search for their CFT compiler (or their C
>>>compiler) and see if you can find some papers on their optimizing.
>>>They _do_ exactly what you describe.  They "lift" (or "hoist") instructions
>>>way back up in the instruction stream so that values are available when needed,
>>>which is _exactly_ what your OOO approach is doing in the hardware.
>>
>>They must be doing this according to static branch prediction, which is maybe
>>80% accurate, not > 90%, and all compilers have scope boundaries for this sort
>>of stuff, i.e., at loops or functions. OOOE has no such restrictions. It's just
>>a stream of instructions.
>>
>No No No.  They do much of this with 100% accuracy.  Because they make sure
>that the critical instructions are executed in _every_ path that reaches a
>critical point in the data-flow analysis of the program (the dependency graph
>for gcc users)...
>
>BTW OOOE has a huge limit.  Something like 40-60 (I don't have my P6/etc
>manuals here at home) micro-ops in the reorder buffer.  No way to do any
>OOOE beyond that very narrow peephole, while the compiler can see _much_
>more, as much as it wants (and has the compile time) to look at...
>
>Someone posted an example of such code (I think it was Matt) showing
>Vincent how to dump branches.  That is the idea here.  The advantage of
>OOO execution is still around, but it is not as significant.  This being
>the fact that the _real_ code doesn't get bigger, while when the compiler
>is busy doing these same kinds of optimizations, it is replicating various
>instructions to be sure they are completed by the time the DG says the
>result is going to be needed.  So there is a bit of memory savings when the
>processor does the OOO stuff, and there is the advantage of exposing more
>registers when the real instructions get turned into micro-ops...  but at
>least the latter is more a result of a horrible architecture (8 registers)
>as opposed to the fact the OOO execution is a huge boon for other architectures
>that are not so register-challenged...
>
>
>
>>>I would not say that either is particularly "better".  They are "different"
>>>with different approaches to the same problem.  The advantage of a ia64-type
>>>approach is that you can stretch the VLIW approach quite a ways, while it
>>>gets harder and harder to do it in an OOO architecture.  You end up with more
>>>hardware in the reorder buffer logic than you have in the actual pipelines
>>>that do the real computation.
>>
>>Is that causing a problem other than offending some people's sensibilities? The
>>EV8 was going to have 8 int ALUs and it would have been perfectly viable with
>>today's processes.
>
>Sure.  But given the choice of OOOE with 8 int alus, or no OOOE with 16
>int alus and an instruction package large enough to feed them all, I would
>consider the latter seriously...
>
>
>
>>
>>>Perhaps.  However the non-OOO Cray has always been right at the top of the
>>>overall performance heap, so that approach can fly as well and it has certainly
>>
>>I don't know much about Crays but a friend of mine told me that he ran some
>>uniprocessor tests on a Cray and it was roughly as fast as a fast 486. Have any
>>Crays been built in the last several years using actual Cray processors?
>
>Your friend was nuts.  The first Cray-1 at 12.5ns clock would blow off any
>486 ever made.  That machine could do 80 mips, which would not be bad for
>a 486.  But it could do 6-8 _operations_ per clock cycle, and those are
>64 bit floating point operations.  The 486 has no chance.

16 processor 100Mhz Cray with cray blitz ==> 500k nps
(i remember you posting here it could do 29 integer instructions a clock.
now you post 6-8 operations a clock. still good compared to the 1 or 2
the 486 can do a second. but your cray blitz didn't use them at all).

16 processor 486 100Mhz == 1.6ghz

1.6Ghz K7 crafty ==> 1.2 MLN nodes a second.

So let's compare the totals again:

1.6Ghz of Cray processing power with very fast RAM and doing 6-8 operations a
clock according to your latest quote ==> 500k nps

1.6Ghz K7 doing at most 3 instructions a clock and like 300 cycles latency to
get a 64 bytes cache line ==> 1.2MLN a second

For chessprograms which have a lot of branches and those are *unavoidable*,
Latest Cray processor released is clocked at 1Ghz, so a 1 Ghz McKinley beats
that for chess programs hands down.

Even old 18.xx crafties get like 1.5MLN nodes a second or so if i remember well
what you posted here.

You just gotta know how to use the cray right Bob?

Best regards,
Vincent

>The Cray T932 was the last 64 bit machine they built that I used.  And it
>can produce a FLOP count that no PC on the planet can come within a factor of
>10 of and that is being very generous.  2ns clock, 32 cpus, each cpu can read
>four words and write two words to memory per clock cycle, and with vector
>chaining, it can do at _least_ eight floating point operations per cycle per
>CPU.
>
>
>>
>>>>As for VC not generating cmovs, Matt, using cmovs on the P6 is slower than not
>>>>using them. That's why they're not generated.
>>>Under what circumstance?  It is possible to have totally unpredictable
>>
>>Under the circumstance of running on a P6, like I said. The P6 has no real
>>support for cmovs; it microcodes them down to several uops and the whole process
>>takes longer than if you just did a branch.
>
>
>I did a branchless FirstOne() in asm a few weeks back here, just to test.
>It used a cmov, and it wasn't slower than the one with a branch.  If the
>branch is accurately predictable, they should probably break even.  If the
>branch is not very predictable, then the cmov should be faster...  I don't
>have the details handy, but I can't imagine it turning into "several" uops.
>two or three, perhaps, assuming we are not talking about a memory reference
>thrown in which would add one.
>
>
>
>
>>
>>>And I believe the VC _will_ produce CMOV instructions, but you have to
>>
>>Eugene can speak to this better than I can, but I don't think VC ever produces
>>cmovs regardless of how you configure it.
>>
>>-Tom
>
>
>I'm not sure why, if that is true.  The PIV has an even longer pipeline
>with a larger branch misprediction penalty...



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.