Author: Robert Hyatt
Date: 19:39:48 02/11/03
Go up one level in this thread
On February 11, 2003 at 18:55:53, Tom Kerrigan wrote: >I saw some of your replies last night but can't see them anymore. I'll just >write a post with what I think are key points. That's probably for the best, as >I don't have enough time to reply to everything. > >There seems to be some confusion about out of order execution. It is not a fix >for poor compiler optimization (as Hyatt suggested). What happens is the chip >predicts the instruction stream according to very accurate (> 90%) dynamic >branch prediction, and then reorders the instructions to maximize ILP. That >means you can grab, say, an instruction 3 branches in the future and execute it >with the "current" instruction if you have an available ALU. Of course, no >compiler can do this. It's why OOOE is credited with a 30% performance gain. > Your explanation was not bad, but your "no compiler can do this" is dead wrong. Visit Cray Research, search for their CFT compiler (or their C compiler) and see if you can find some papers on their optimizing. They _do_ exactly what you describe. They "lift" (or "hoist") instructions way back up in the instruction stream so that values are available when needed, which is _exactly_ what your OOO approach is doing in the hardware. The only thing OOO can do that a compiler can't is take advantage of renaming to discover inherent parallelism that doesn't exist from the compiler's point of view since it can't see the extra rename registers. I'd suspect that without renaming, the OOO part of the pentium would be essentially worthless. ] >IA-64's answer to this is predication, which allows you to start executing both >code paths following a "branch" (test) before the branch is resolved. So the >effect is similar to reordering but you waste resources following the wrong code >path, so you want to avoid predication with branches that are likely to be >predicted correctly, and when you avoid predication, you lose the benefit of >pseudo-reordering. > >To see all of this explanation in action, just look at existing processors. >First, the US3 is the only in-order "high performance" RISC chip and its >performance is miserable compared to similarly clocked chips. That's why the >next US will be OOO. Incidentally, the US3's performance is much worse if the >compiler doesn't schedule the instructions for parallel execution, so that's not >the problem. Second, the ARM is extremely fast compared to similar chips, and >people credit that to predication, even though predication is used for only >about 5% of branches on average. (To avoid wasting resources, as explained >above.) McKinley, an in-order chip with predication and massive execution >resources performs similarly to out-of-order chips with fewer resources. > >My point is that IA-64 does not offer better performance than regular OOO >designs in theory or in practice, but it does increase code/compiler complexity, >limits clock speed (with the huge register set), and prevents SMT. I would not say that either is particularly "better". They are "different" with different approaches to the same problem. The advantage of a ia64-type approach is that you can stretch the VLIW approach quite a ways, while it gets harder and harder to do it in an OOO architecture. You end up with more hardware in the reorder buffer logic than you have in the actual pipelines that do the real computation. > >Ideally, we'd have a normal RISC instruction set with predication, speculative >loads, and branch hinting. All of the benefits of IA-64 and none of the >disadvantages. Alpha was going in that direction. x86-64 is halfway there (cmovs >are predication of sorts) but the instruction set unfortunately isn't RISC. I would not disagree at all. The best solution obviously lies somewhere between the two extreme solutions. > >I predict that future IA-64 processors will break up instruction words and >execute the 40-bit instructions OOO to increase performance. I found one >research paper on the web that discusses that already. Perhaps. However the non-OOO Cray has always been right at the top of the overall performance heap, so that approach can fly as well and it has certainly passed "the test of time". It isn't a VLIW machine in any form, but it also isn't an OOO machine either. Yet it is fast as hell, mainly because the compiler can do the same sort of things that you might do with predication, as one example. When you think about it, the compiler _must_ be able to do the same things that are done with predication, along the same lines as the classic proof that the four-tape turing machine offers no additional computational capability that is not available in a single-tape machine, if you are willing to do the "work"... > >As for McKinley performing well for Crafty, sure. Take a program that relies >heavily on 64-bit operations and has lots of unpredictable branches and stresses >way more than 8k of data and it's obviously going to run badly on a 32-bit chip >with a long pipeline and an 8k L1 data cache. Let's see how well Crafty runs on >Hammer; 64-bit, shorter pipeline, 64k L1... > >As for VC not generating cmovs, Matt, using cmovs on the P6 is slower than not >using them. That's why they're not generated. Under what circumstance? It is possible to have totally unpredictable branches, and for those cases, cmov will blow the doors off the non-cmov code. The PIV has a very classy prediction scheme, better than anything yet done that I am aware of. But it _still_ suffers from mis-prediction, and for things like x=(wtm) ? 0 : 1; to invert wtm, cmov is _very_ effective without resorting to a branch... And I believe the VC _will_ produce CMOV instructions, but you have to specifically tell it to produce p6 code only, not something that will run on all architectures. Intel produces them, or at least it did the last time I spent any effort looking at .s output... GCC also does this and has caused some problems for me for early AMDs that didn't support CMOV... BTW if you didn't read, posts were somehow dropped into /dev/null by some sort of glitch. Whether they have come back or not I have not yet checked... > >-Tom
This page took 0.02 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.