Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: IA-64 vs OOOE (attn Taylor, Hyatt)

Author: Robert Hyatt

Date: 21:37:13 02/11/03

Go up one level in this thread


On February 11, 2003 at 23:24:43, Tom Kerrigan wrote:

>On February 11, 2003 at 22:39:48, Robert Hyatt wrote:
>
>>Your explanation was not bad, but your "no compiler can do this" is dead
>>wrong.  Visit Cray Research, search for their CFT compiler (or their C
>>compiler) and see if you can find some papers on their optimizing.
>>They _do_ exactly what you describe.  They "lift" (or "hoist") instructions
>>way back up in the instruction stream so that values are available when needed,
>>which is _exactly_ what your OOO approach is doing in the hardware.
>
>They must be doing this according to static branch prediction, which is maybe
>80% accurate, not > 90%, and all compilers have scope boundaries for this sort
>of stuff, i.e., at loops or functions. OOOE has no such restrictions. It's just
>a stream of instructions.
>
No No No.  They do much of this with 100% accuracy.  Because they make sure
that the critical instructions are executed in _every_ path that reaches a
critical point in the data-flow analysis of the program (the dependency graph
for gcc users)...

BTW OOOE has a huge limit.  Something like 40-60 (I don't have my P6/etc
manuals here at home) micro-ops in the reorder buffer.  No way to do any
OOOE beyond that very narrow peephole, while the compiler can see _much_
more, as much as it wants (and has the compile time) to look at...

Someone posted an example of such code (I think it was Matt) showing
Vincent how to dump branches.  That is the idea here.  The advantage of
OOO execution is still around, but it is not as significant.  This being
the fact that the _real_ code doesn't get bigger, while when the compiler
is busy doing these same kinds of optimizations, it is replicating various
instructions to be sure they are completed by the time the DG says the
result is going to be needed.  So there is a bit of memory savings when the
processor does the OOO stuff, and there is the advantage of exposing more
registers when the real instructions get turned into micro-ops...  but at
least the latter is more a result of a horrible architecture (8 registers)
as opposed to the fact the OOO execution is a huge boon for other architectures
that are not so register-challenged...



>>I would not say that either is particularly "better".  They are "different"
>>with different approaches to the same problem.  The advantage of a ia64-type
>>approach is that you can stretch the VLIW approach quite a ways, while it
>>gets harder and harder to do it in an OOO architecture.  You end up with more
>>hardware in the reorder buffer logic than you have in the actual pipelines
>>that do the real computation.
>
>Is that causing a problem other than offending some people's sensibilities? The
>EV8 was going to have 8 int ALUs and it would have been perfectly viable with
>today's processes.

Sure.  But given the choice of OOOE with 8 int alus, or no OOOE with 16
int alus and an instruction package large enough to feed them all, I would
consider the latter seriously...



>
>>Perhaps.  However the non-OOO Cray has always been right at the top of the
>>overall performance heap, so that approach can fly as well and it has certainly
>
>I don't know much about Crays but a friend of mine told me that he ran some
>uniprocessor tests on a Cray and it was roughly as fast as a fast 486. Have any
>Crays been built in the last several years using actual Cray processors?

Your friend was nuts.  The first Cray-1 at 12.5ns clock would blow off any
486 ever made.  That machine could do 80 mips, which would not be bad for
a 486.  But it could do 6-8 _operations_ per clock cycle, and those are
64 bit floating point operations.  The 486 has no chance.

The Cray T932 was the last 64 bit machine they built that I used.  And it
can produce a FLOP count that no PC on the planet can come within a factor of
10 of and that is being very generous.  2ns clock, 32 cpus, each cpu can read
four words and write two words to memory per clock cycle, and with vector
chaining, it can do at _least_ eight floating point operations per cycle per
CPU.


>
>>>As for VC not generating cmovs, Matt, using cmovs on the P6 is slower than not
>>>using them. That's why they're not generated.
>>Under what circumstance?  It is possible to have totally unpredictable
>
>Under the circumstance of running on a P6, like I said. The P6 has no real
>support for cmovs; it microcodes them down to several uops and the whole process
>takes longer than if you just did a branch.


I did a branchless FirstOne() in asm a few weeks back here, just to test.
It used a cmov, and it wasn't slower than the one with a branch.  If the
branch is accurately predictable, they should probably break even.  If the
branch is not very predictable, then the cmov should be faster...  I don't
have the details handy, but I can't imagine it turning into "several" uops.
two or three, perhaps, assuming we are not talking about a memory reference
thrown in which would add one.




>
>>And I believe the VC _will_ produce CMOV instructions, but you have to
>
>Eugene can speak to this better than I can, but I don't think VC ever produces
>cmovs regardless of how you configure it.
>
>-Tom


I'm not sure why, if that is true.  The PIV has an even longer pipeline
with a larger branch misprediction penalty...



This page took 0.02 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.