Computer Chess Club Archives


Search

Terms

Messages

Subject: IA-64 vs OOOE (attn Taylor, Hyatt)

Author: Tom Kerrigan

Date: 15:55:53 02/11/03


I saw some of your replies last night but can't see them anymore. I'll just
write a post with what I think are key points. That's probably for the best, as
I don't have enough time to reply to everything.

There seems to be some confusion about out of order execution. It is not a fix
for poor compiler optimization (as Hyatt suggested). What happens is the chip
predicts the instruction stream according to very accurate (> 90%) dynamic
branch prediction, and then reorders the instructions to maximize ILP. That
means you can grab, say, an instruction 3 branches in the future and execute it
with the "current" instruction if you have an available ALU. Of course, no
compiler can do this. It's why OOOE is credited with a 30% performance gain.

IA-64's answer to this is predication, which allows you to start executing both
code paths following a "branch" (test) before the branch is resolved. So the
effect is similar to reordering but you waste resources following the wrong code
path, so you want to avoid predication with branches that are likely to be
predicted correctly, and when you avoid predication, you lose the benefit of
pseudo-reordering.

To see all of this explanation in action, just look at existing processors.
First, the US3 is the only in-order "high performance" RISC chip and its
performance is miserable compared to similarly clocked chips. That's why the
next US will be OOO. Incidentally, the US3's performance is much worse if the
compiler doesn't schedule the instructions for parallel execution, so that's not
the problem. Second, the ARM is extremely fast compared to similar chips, and
people credit that to predication, even though predication is used for only
about 5% of branches on average. (To avoid wasting resources, as explained
above.) McKinley, an in-order chip with predication and massive execution
resources performs similarly to out-of-order chips with fewer resources.

My point is that IA-64 does not offer better performance than regular OOO
designs in theory or in practice, but it does increase code/compiler complexity,
limits clock speed (with the huge register set), and prevents SMT.

Ideally, we'd have a normal RISC instruction set with predication, speculative
loads, and branch hinting. All of the benefits of IA-64 and none of the
disadvantages. Alpha was going in that direction. x86-64 is halfway there (cmovs
are predication of sorts) but the instruction set unfortunately isn't RISC.

I predict that future IA-64 processors will break up instruction words and
execute the 40-bit instructions OOO to increase performance. I found one
research paper on the web that discusses that already.

As for McKinley performing well for Crafty, sure. Take a program that relies
heavily on 64-bit operations and has lots of unpredictable branches and stresses
way more than 8k of data and it's obviously going to run badly on a 32-bit chip
with a long pipeline and an 8k L1 data cache. Let's see how well Crafty runs on
Hammer; 64-bit, shorter pipeline, 64k L1...

As for VC not generating cmovs, Matt, using cmovs on the P6 is slower than not
using them. That's why they're not generated.

-Tom



This page took 0.05 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.