Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: IA-64 vs OOOE (attn Taylor, Hyatt)

Author: Robert Hyatt

Date: 19:39:48 02/11/03

Go up one level in this thread


On February 11, 2003 at 18:55:53, Tom Kerrigan wrote:

>I saw some of your replies last night but can't see them anymore. I'll just
>write a post with what I think are key points. That's probably for the best, as
>I don't have enough time to reply to everything.
>
>There seems to be some confusion about out of order execution. It is not a fix
>for poor compiler optimization (as Hyatt suggested). What happens is the chip
>predicts the instruction stream according to very accurate (> 90%) dynamic
>branch prediction, and then reorders the instructions to maximize ILP. That
>means you can grab, say, an instruction 3 branches in the future and execute it
>with the "current" instruction if you have an available ALU. Of course, no
>compiler can do this. It's why OOOE is credited with a 30% performance gain.
>


Your explanation was not bad, but your "no compiler can do this" is dead
wrong.  Visit Cray Research, search for their CFT compiler (or their C
compiler) and see if you can find some papers on their optimizing.

They _do_ exactly what you describe.  They "lift" (or "hoist") instructions
way back up in the instruction stream so that values are available when needed,
which is _exactly_ what your OOO approach is doing in the hardware.

The only thing OOO can do that a compiler can't is take advantage of renaming
to discover inherent parallelism that doesn't exist from the compiler's point
of view since it can't see the extra rename registers.  I'd suspect that without
renaming, the OOO part of the pentium would be essentially worthless.
]


>IA-64's answer to this is predication, which allows you to start executing both
>code paths following a "branch" (test) before the branch is resolved. So the
>effect is similar to reordering but you waste resources following the wrong code
>path, so you want to avoid predication with branches that are likely to be
>predicted correctly, and when you avoid predication, you lose the benefit of
>pseudo-reordering.
>
>To see all of this explanation in action, just look at existing processors.
>First, the US3 is the only in-order "high performance" RISC chip and its
>performance is miserable compared to similarly clocked chips. That's why the
>next US will be OOO. Incidentally, the US3's performance is much worse if the
>compiler doesn't schedule the instructions for parallel execution, so that's not
>the problem. Second, the ARM is extremely fast compared to similar chips, and
>people credit that to predication, even though predication is used for only
>about 5% of branches on average. (To avoid wasting resources, as explained
>above.) McKinley, an in-order chip with predication and massive execution
>resources performs similarly to out-of-order chips with fewer resources.
>
>My point is that IA-64 does not offer better performance than regular OOO
>designs in theory or in practice, but it does increase code/compiler complexity,
>limits clock speed (with the huge register set), and prevents SMT.

I would not say that either is particularly "better".  They are "different"
with different approaches to the same problem.  The advantage of a ia64-type
approach is that you can stretch the VLIW approach quite a ways, while it
gets harder and harder to do it in an OOO architecture.  You end up with more
hardware in the reorder buffer logic than you have in the actual pipelines
that do the real computation.



>
>Ideally, we'd have a normal RISC instruction set with predication, speculative
>loads, and branch hinting. All of the benefits of IA-64 and none of the
>disadvantages. Alpha was going in that direction. x86-64 is halfway there (cmovs
>are predication of sorts) but the instruction set unfortunately isn't RISC.


I would not disagree at all.  The best solution obviously lies somewhere
between the two extreme solutions.



>
>I predict that future IA-64 processors will break up instruction words and
>execute the 40-bit instructions OOO to increase performance. I found one
>research paper on the web that discusses that already.

Perhaps.  However the non-OOO Cray has always been right at the top of the
overall performance heap, so that approach can fly as well and it has certainly
passed "the test of time".  It isn't a VLIW machine in any form, but it also
isn't an OOO machine either.  Yet it is fast as hell, mainly because the
compiler can do the same sort of things that you might do with predication,
as one example.  When you think about it, the compiler _must_ be able to
do the same things that are done with predication, along the same lines as the
classic proof that the four-tape turing machine offers no additional
computational capability that is not available in a single-tape machine, if you
are willing to do the "work"...





>
>As for McKinley performing well for Crafty, sure. Take a program that relies
>heavily on 64-bit operations and has lots of unpredictable branches and stresses
>way more than 8k of data and it's obviously going to run badly on a 32-bit chip
>with a long pipeline and an 8k L1 data cache. Let's see how well Crafty runs on
>Hammer; 64-bit, shorter pipeline, 64k L1...
>
>As for VC not generating cmovs, Matt, using cmovs on the P6 is slower than not
>using them. That's why they're not generated.


Under what circumstance?  It is possible to have totally unpredictable
branches, and for those cases, cmov will blow the doors off the non-cmov
code.  The PIV has a very classy prediction scheme, better than anything yet
done that I am aware of.  But it _still_ suffers from mis-prediction, and for
things like  x=(wtm) ? 0 : 1; to invert wtm, cmov is _very_ effective without
resorting to a branch...

And I believe the VC _will_ produce CMOV instructions, but you have to
specifically tell it to produce p6 code only, not something that will run
on all architectures.  Intel produces them, or at least it did the last time
I spent any effort looking at .s output...  GCC also does this and has caused
some problems for me for early AMDs that didn't support CMOV...

BTW if you didn't read, posts were somehow dropped into /dev/null by some
sort of glitch.  Whether they have come back or not I have not yet checked...


>
>-Tom



This page took 0.02 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.