Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: But that PC with crafty would beat Cray Blitz though

Author: Robert Hyatt
Date: 09:04:38 02/13/03
On February 13, 2003 at 00:26:15, Vincent Diepeveen wrote:

>On February 12, 2003 at 00:37:13, Robert Hyatt wrote:
>
>>On February 11, 2003 at 23:24:43, Tom Kerrigan wrote:
>>
>>>On February 11, 2003 at 22:39:48, Robert Hyatt wrote:
>>>
>>>>Your explanation was not bad, but your "no compiler can do this" is dead
>>>>wrong.  Visit Cray Research, search for their CFT compiler (or their C
>>>>compiler) and see if you can find some papers on their optimizing.
>>>>They _do_ exactly what you describe.  They "lift" (or "hoist") instructions
>>>>way back up in the instruction stream so that values are available when needed,
>>>>which is _exactly_ what your OOO approach is doing in the hardware.
>>>
>>>They must be doing this according to static branch prediction, which is maybe
>>>80% accurate, not > 90%, and all compilers have scope boundaries for this sort
>>>of stuff, i.e., at loops or functions. OOOE has no such restrictions. It's just
>>>a stream of instructions.
>>>
>>No No No.  They do much of this with 100% accuracy.  Because they make sure
>>that the critical instructions are executed in _every_ path that reaches a
>>critical point in the data-flow analysis of the program (the dependency graph
>>for gcc users)...
>>
>>BTW OOOE has a huge limit.  Something like 40-60 (I don't have my P6/etc
>>manuals here at home) micro-ops in the reorder buffer.  No way to do any
>>OOOE beyond that very narrow peephole, while the compiler can see _much_
>>more, as much as it wants (and has the compile time) to look at...
>>
>>Someone posted an example of such code (I think it was Matt) showing
>>Vincent how to dump branches.  That is the idea here.  The advantage of
>>OOO execution is still around, but it is not as significant.  This being
>>the fact that the _real_ code doesn't get bigger, while when the compiler
>>is busy doing these same kinds of optimizations, it is replicating various
>>instructions to be sure they are completed by the time the DG says the
>>result is going to be needed.  So there is a bit of memory savings when the
>>processor does the OOO stuff, and there is the advantage of exposing more
>>registers when the real instructions get turned into micro-ops...  but at
>>least the latter is more a result of a horrible architecture (8 registers)
>>as opposed to the fact the OOO execution is a huge boon for other architectures
>>that are not so register-challenged...
>>
>>
>>
>>>>I would not say that either is particularly "better".  They are "different"
>>>>with different approaches to the same problem.  The advantage of a ia64-type
>>>>approach is that you can stretch the VLIW approach quite a ways, while it
>>>>gets harder and harder to do it in an OOO architecture.  You end up with more
>>>>hardware in the reorder buffer logic than you have in the actual pipelines
>>>>that do the real computation.
>>>
>>>Is that causing a problem other than offending some people's sensibilities? The
>>>EV8 was going to have 8 int ALUs and it would have been perfectly viable with
>>>today's processes.
>>
>>Sure.  But given the choice of OOOE with 8 int alus, or no OOOE with 16
>>int alus and an instruction package large enough to feed them all, I would
>>consider the latter seriously...
>>
>>
>>
>>>
>>>>Perhaps.  However the non-OOO Cray has always been right at the top of the
>>>>overall performance heap, so that approach can fly as well and it has certainly
>>>
>>>I don't know much about Crays but a friend of mine told me that he ran some
>>>uniprocessor tests on a Cray and it was roughly as fast as a fast 486. Have any
>>>Crays been built in the last several years using actual Cray processors?
>>
>>Your friend was nuts.  The first Cray-1 at 12.5ns clock would blow off any
>>486 ever made.  That machine could do 80 mips, which would not be bad for
>>a 486.  But it could do 6-8 _operations_ per clock cycle, and those are
>>64 bit floating point operations.  The 486 has no chance.
>
>16 processor 100Mhz Cray with cray blitz ==> 500k nps
>(i remember you posting here it could do 29 integer instructions a clock.
>now you post 6-8 operations a clock. still good compared to the 1 or 2
>the 486 can do a second. but your cray blitz didn't use them at all).

First, you have a _real_ problem paying attention.  The DTS article was written
using
a Cray C90, with 16 processors running at 4.1ns per clock or about 250mhz.  No
idea
where your "100mhz x 16" comes from, but it is wrong.

The machine I mentioned as the fastest Cray I ever personally ran on was a 32
processor
T90, running at 2ns per clock or 500mhz.  I have no idea where you got "29
integer
instructions per clock" as that is not a number I have _ever_ used.  And I
believe that
most understand vector machines and vector chaining, and realize that in a
single clock
that a single cpu can do at least four simultaneous operations, and since the
T90 does
an operation on a pair of values rather than on just one, that doubles to at
least 8
operations per clock cycle.  Per cpu.  Which has _nothing_ to do with
instructions per
second or anything else, since this is a _vector_ architecture.


>
>16 processor 486 100Mhz == 1.6ghz
>
>1.6Ghz K7 crafty ==> 1.2 MLN nodes a second.
>
>So let's compare the totals again:
>
>1.6Ghz of Cray processing power with very fast RAM and doing 6-8 operations a
>clock according to your latest quote ==> 500k nps
>

Using the C90 numbers that is correct.  Using the T90 numbers that scales to
about 7M
nodes per second as you well know because I played a match against a T90 using
my quad
700 and the quad 700 got smashed.





>1.6Ghz K7 doing at most 3 instructions a clock and like 300 cycles latency to
>get a 64 bytes cache line ==> 1.2MLN a second
>
>For chessprograms which have a lot of branches and those are *unavoidable*,
>Latest Cray processor released is clocked at 1Ghz, so a 1 Ghz McKinley beats
>that for chess programs hands down.


That simply exposes your ignorance of what vector machines are all about.  And I
don't believe I can correct that ignorance with a short post here so I won't
try.
Find any good architecture book and read.  Then you will see why you don't make
comparisons as above and have people laughing at the comments.





>
>Even old 18.xx crafties get like 1.5MLN nodes a second or so if i remember well
>what you posted here.
>

On what machine?  18.xx got 1M on my quad 700, up to a max of about 1.5M.



>You just gotta know how to use the cray right Bob?
>

That's the point.  Its all about vector operations.




>Best regards,
>Vincent
>
>>The Cray T932 was the last 64 bit machine they built that I used.  And it
>>can produce a FLOP count that no PC on the planet can come within a factor of
>>10 of and that is being very generous.  2ns clock, 32 cpus, each cpu can read
>>four words and write two words to memory per clock cycle, and with vector
>>chaining, it can do at _least_ eight floating point operations per cycle per
>>CPU.
>>
>>
>>>
>>>>>As for VC not generating cmovs, Matt, using cmovs on the P6 is slower than not
>>>>>using them. That's why they're not generated.
>>>>Under what circumstance?  It is possible to have totally unpredictable
>>>
>>>Under the circumstance of running on a P6, like I said. The P6 has no real
>>>support for cmovs; it microcodes them down to several uops and the whole process
>>>takes longer than if you just did a branch.
>>
>>
>>I did a branchless FirstOne() in asm a few weeks back here, just to test.
>>It used a cmov, and it wasn't slower than the one with a branch.  If the
>>branch is accurately predictable, they should probably break even.  If the
>>branch is not very predictable, then the cmov should be faster...  I don't
>>have the details handy, but I can't imagine it turning into "several" uops.
>>two or three, perhaps, assuming we are not talking about a memory reference
>>thrown in which would add one.
>>
>>
>>
>>
>>>
>>>>And I believe the VC _will_ produce CMOV instructions, but you have to
>>>
>>>Eugene can speak to this better than I can, but I don't think VC ever produces
>>>cmovs regardless of how you configure it.
>>>
>>>-Tom
>>
>>
>>I'm not sure why, if that is true.  The PIV has an even longer pipeline
>>with a larger branch misprediction penalty...
Vector Processing Jeremiah Penery 03:06:41 02/14/03
- Re: Vector Processing Robert Hyatt 08:48:07 02/14/03
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.