Author: Tom Kerrigan
Date: 15:55:53 02/11/03
I saw some of your replies last night but can't see them anymore. I'll just write a post with what I think are key points. That's probably for the best, as I don't have enough time to reply to everything. There seems to be some confusion about out of order execution. It is not a fix for poor compiler optimization (as Hyatt suggested). What happens is the chip predicts the instruction stream according to very accurate (> 90%) dynamic branch prediction, and then reorders the instructions to maximize ILP. That means you can grab, say, an instruction 3 branches in the future and execute it with the "current" instruction if you have an available ALU. Of course, no compiler can do this. It's why OOOE is credited with a 30% performance gain. IA-64's answer to this is predication, which allows you to start executing both code paths following a "branch" (test) before the branch is resolved. So the effect is similar to reordering but you waste resources following the wrong code path, so you want to avoid predication with branches that are likely to be predicted correctly, and when you avoid predication, you lose the benefit of pseudo-reordering. To see all of this explanation in action, just look at existing processors. First, the US3 is the only in-order "high performance" RISC chip and its performance is miserable compared to similarly clocked chips. That's why the next US will be OOO. Incidentally, the US3's performance is much worse if the compiler doesn't schedule the instructions for parallel execution, so that's not the problem. Second, the ARM is extremely fast compared to similar chips, and people credit that to predication, even though predication is used for only about 5% of branches on average. (To avoid wasting resources, as explained above.) McKinley, an in-order chip with predication and massive execution resources performs similarly to out-of-order chips with fewer resources. My point is that IA-64 does not offer better performance than regular OOO designs in theory or in practice, but it does increase code/compiler complexity, limits clock speed (with the huge register set), and prevents SMT. Ideally, we'd have a normal RISC instruction set with predication, speculative loads, and branch hinting. All of the benefits of IA-64 and none of the disadvantages. Alpha was going in that direction. x86-64 is halfway there (cmovs are predication of sorts) but the instruction set unfortunately isn't RISC. I predict that future IA-64 processors will break up instruction words and execute the 40-bit instructions OOO to increase performance. I found one research paper on the web that discusses that already. As for McKinley performing well for Crafty, sure. Take a program that relies heavily on 64-bit operations and has lots of unpredictable branches and stresses way more than 8k of data and it's obviously going to run badly on a 32-bit chip with a long pipeline and an 8k L1 data cache. Let's see how well Crafty runs on Hammer; 64-bit, shorter pipeline, 64k L1... As for VC not generating cmovs, Matt, using cmovs on the P6 is slower than not using them. That's why they're not generated. -Tom
This page took 0.05 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.