Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: IA-64 vs OOOE (attn Taylor, Hyatt)

Author: Matt Taylor

Date: 22:58:36 02/11/03

Go up one level in this thread


On February 11, 2003 at 18:55:53, Tom Kerrigan wrote:

>I saw some of your replies last night but can't see them anymore. I'll just
>write a post with what I think are key points. That's probably for the best, as
>I don't have enough time to reply to everything.
>
>There seems to be some confusion about out of order execution. It is not a fix
>for poor compiler optimization (as Hyatt suggested). What happens is the chip
>predicts the instruction stream according to very accurate (> 90%) dynamic
>branch prediction, and then reorders the instructions to maximize ILP. That
>means you can grab, say, an instruction 3 branches in the future and execute it
>with the "current" instruction if you have an available ALU. Of course, no
>compiler can do this. It's why OOOE is credited with a 30% performance gain.

Branch prediction is mostly unrelated to OOOE. The IA-64 does have branch
prediction. It does not have OOOE.

There is no reason why the compiler cannot do the same thing that OOOE hardware
does. If the performance gain of OOOE is based on the compiler being unable to
schedule, compilers that can schedule eliminate the need for OOOE. Compilers
already do scheduling -- just that most don't try to maximize IPC with OOOE;
they try to avoid data dependency. One example is moving loop invariants outside
of loops, something that GCC does commonly. Others include preloading registers
with function pointers. The list goes on and on.

OOOE is more flexible that compiler-scheduled code because the architecture is
free to change without breaking legacy code or requiring compiler rewrites.

On a final note, the Athlon has a 72-entry integer scheduler and IIRC a 36-entry
FP scheduler. Athlon can therefore see up to 72 instructions ahead. (Remember --
1 DirectPath instruction translates 1:1 with a macro-op, or this is the
impression I get from the docs anyway.) The compiler can still see futher.

>IA-64's answer to this is predication, which allows you to start executing both
>code paths following a "branch" (test) before the branch is resolved. So the
>effect is similar to reordering but you waste resources following the wrong code
>path, so you want to avoid predication with branches that are likely to be
>predicted correctly, and when you avoid predication, you lose the benefit of
>pseudo-reordering.

No. Predication is the IA-64's answer to branch prediction. Predication is
completely unrelated to OOOE.

>To see all of this explanation in action, just look at existing processors.
>First, the US3 is the only in-order "high performance" RISC chip and its
>performance is miserable compared to similarly clocked chips. That's why the
>next US will be OOO. Incidentally, the US3's performance is much worse if the
>compiler doesn't schedule the instructions for parallel execution, so that's not
>the problem. Second, the ARM is extremely fast compared to similar chips, and
>people credit that to predication, even though predication is used for only
>about 5% of branches on average. (To avoid wasting resources, as explained
>above.) McKinley, an in-order chip with predication and massive execution
>resources performs similarly to out-of-order chips with fewer resources.
>
>My point is that IA-64 does not offer better performance than regular OOO
>designs in theory or in practice, but it does increase code/compiler complexity,
>limits clock speed (with the huge register set), and prevents SMT.

You have not discussed theory at all. The ISA design is very nice for promoting
IPC.

Dr. Hyatt's figures "in practice" still show an 1 GHz McKinley 4 times faster
clock-for-clock than a Pentium 4.

Scheduling code is not particularly complex to write, and good compiler code is
very modular. It should not be a problem to add scheduling.

Clock speed limitations are beyond my discussion as the original question did
not ask about the future of the architectures. I would have to defer the
question anyway as I am more interested in the software side.

HT is still possible (unless you mean to say that HT is difficult because of
register file size), but for practicality I would again defer.

The last two questions are more an answer to the question, "How will the IA-64
scale?" The question I am answering (and the question originally asked) is, "How
fast currently is the IA-64 compared to the IA-32/AA-64 in Chess?"

>Ideally, we'd have a normal RISC instruction set with predication, speculative
>loads, and branch hinting. All of the benefits of IA-64 and none of the
>disadvantages. Alpha was going in that direction. x86-64 is halfway there (cmovs
>are predication of sorts) but the instruction set unfortunately isn't RISC.

Yes, the ISA is CISC. The chip itself is RISC; older CISC instructions are
emulated with a microcode assist. It can issue 3 ALU ops per cycle, 2 FP ops per
cycle, or 4 MMX (SSE too?) ops per cycle.

>I predict that future IA-64 processors will break up instruction words and
>execute the 40-bit instructions OOO to increase performance. I found one
>research paper on the web that discusses that already.

Perhaps. The original discussion was not about the future.

>As for McKinley performing well for Crafty, sure. Take a program that relies
>heavily on 64-bit operations and has lots of unpredictable branches and stresses
>way more than 8k of data and it's obviously going to run badly on a 32-bit chip
>with a long pipeline and an 8k L1 data cache. Let's see how well Crafty runs on
>Hammer; 64-bit, shorter pipeline, 64k L1...

Athlon has a 64K L1 data cache. Overclocked Athlons still do not hit 2 MN/sec
like McKinley. Aaron Gordon has compiled an fair-sized list of data for Crafty
including overclocked chips. The fastest Athlon sold is 2.25 GHz -- the AthlonXP
2800 (Thoroughbred-B). The fastest Pentium 4 sold is the 3.06 GHz w/HT.

http://speedycpu.dyndns.org/crafty/bench.html

The x86-64 will change a lot of things, and it will be difficult to say which
benefits Chess the most -- the reduced memory access latency, the extra
registers, the 64-bitness (for bitboard engines like Crafty)?

>As for VC not generating cmovs, Matt, using cmovs on the P6 is slower than not
>using them. That's why they're not generated.
>
>-Tom

Mispredicted branches are slower than cmov on the P6. According to the P6
optimization manual, cmov takes 2-3 micro-ops. (An extra load micro-op is
generated when cmov references memory.)

I believe the real reason as to why VC does not generate cmov, even when
optimizing for P6, is because it maintains 386 compatibility. Unfortunately
there is no flag to say, "I want you to use P6 instructions." GCC does, and I
was corrected earlier. GCC will generate cmov when you select Athlon
architecture. I have not tried others, but I presume it will do the same for the
P6. Default is likewise 386 compatibility, I believe.

-Matt



This page took 0.02 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.