Author: Matt Taylor
Date: 11:54:37 02/12/03
Go up one level in this thread
On February 12, 2003 at 11:35:46, Robert Hyatt wrote: >On February 12, 2003 at 03:13:27, Matt Taylor wrote: > >>On February 12, 2003 at 00:23:53, Robert Hyatt wrote: >> >>>On February 11, 2003 at 23:27:04, Tom Kerrigan wrote: >>> >>>>On February 11, 2003 at 23:11:09, Charles Roberson wrote: >>>> >>>>> >>>>> Out-of-order execution is nothing more than the ability to execute >>>>>instructions in an order different from the serial order in the code. >>>>>It has nothing to do with branching, but it enables other branching techniques. >>>>>OOOE is simply: >>>>> 1) the code has instructions a,b,c,d, in that order >>>>> 2) if there are no serial dependencies then they can be executed in the >>>>> b,d,c,a order. >>>>> >>>>> That is all OOOE is. >>>> >>>>I don't see how this is different from what I said. Branches are instructions >>>>too. >>>> >>>>-Tom >>> >>> >>>What he is saying is that whatever the hardware can do with OOO execution, >>>the compiler can replicate it by massaging the instruction stream with well- >>>known optimization tricks. With the sole exception of register renaming. >>> >>>The reason OOO execution works so well on Intel is _solely_ based on the >>>fact that the architecture has almost no registers. And renaming lets the >>>hardware expand that number of registers _significantly_ so that the >>>architecture can do things that other less-register-challenged architectures >>>can do without OOO execution as a crutch... >>> >>>IE I can show you code for the Cray that executes an instruction every cycle >>>that an instruction can execute, yet it is a serial-order execution processor >>>from the ground-up, but with help from a _really_ good instruction scheduler >>>pass after the final object code has been generated... This scheduler can >>>replicate/hoist instructions as needed to back them up to the point that their >>>result is ready the cycle it is needed... >> >>Some of my bitscan code for the Athlon executed a useful instruction in every >>slot -- 3 IPC in 15-20 cycles of code. The sole enabling factor was the fact >>that I moved instructions everywhere. It was a nightmare to debug when I >>accidentally moved instructions in front of their dependencies. >> >>One of the biggest gains I had was moving register loads a fair number of cycles >>backward when I had free slots. This is difficult on IA-32 for obvious reasons, >>but it works very well when you have a larger number of registers. >> >>-Matt > > >Right (to your last paragraph). And it is a major reason why OOOE is worthwhile >on >the X86, because the hardware can rename the registers, lift the instructions >and execute >them earlier so that by the time the data is needed, it is available in a >"hidden" register >that suddenly has the needed data at the right time. > >No way to do that in a compiler. Unless the architecture has enough registers >so that >you don't run out with all the pre-loads. I can think of few functions that have 128 intermediate computations, even if they do heavy computation. The compiler often won't need to throw away any intermediates. Matrix manipulations and quaternion math come foremost to mind as good examples. Spaghetti code wouldn't have too much trouble in creating poor performance, but that also has an obvious rebuttle. -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.