Author: Robert Hyatt
Date: 12:37:05 02/12/03
Go up one level in this thread
On February 12, 2003 at 14:54:37, Matt Taylor wrote: >On February 12, 2003 at 11:35:46, Robert Hyatt wrote: > >>On February 12, 2003 at 03:13:27, Matt Taylor wrote: >> >>>On February 12, 2003 at 00:23:53, Robert Hyatt wrote: >>> >>>>On February 11, 2003 at 23:27:04, Tom Kerrigan wrote: >>>> >>>>>On February 11, 2003 at 23:11:09, Charles Roberson wrote: >>>>> >>>>>> >>>>>> Out-of-order execution is nothing more than the ability to execute >>>>>>instructions in an order different from the serial order in the code. >>>>>>It has nothing to do with branching, but it enables other branching techniques. >>>>>>OOOE is simply: >>>>>> 1) the code has instructions a,b,c,d, in that order >>>>>> 2) if there are no serial dependencies then they can be executed in the >>>>>> b,d,c,a order. >>>>>> >>>>>> That is all OOOE is. >>>>> >>>>>I don't see how this is different from what I said. Branches are instructions >>>>>too. >>>>> >>>>>-Tom >>>> >>>> >>>>What he is saying is that whatever the hardware can do with OOO execution, >>>>the compiler can replicate it by massaging the instruction stream with well- >>>>known optimization tricks. With the sole exception of register renaming. >>>> >>>>The reason OOO execution works so well on Intel is _solely_ based on the >>>>fact that the architecture has almost no registers. And renaming lets the >>>>hardware expand that number of registers _significantly_ so that the >>>>architecture can do things that other less-register-challenged architectures >>>>can do without OOO execution as a crutch... >>>> >>>>IE I can show you code for the Cray that executes an instruction every cycle >>>>that an instruction can execute, yet it is a serial-order execution processor >>>>from the ground-up, but with help from a _really_ good instruction scheduler >>>>pass after the final object code has been generated... This scheduler can >>>>replicate/hoist instructions as needed to back them up to the point that their >>>>result is ready the cycle it is needed... >>> >>>Some of my bitscan code for the Athlon executed a useful instruction in every >>>slot -- 3 IPC in 15-20 cycles of code. The sole enabling factor was the fact >>>that I moved instructions everywhere. It was a nightmare to debug when I >>>accidentally moved instructions in front of their dependencies. >>> >>>One of the biggest gains I had was moving register loads a fair number of cycles >>>backward when I had free slots. This is difficult on IA-32 for obvious reasons, >>>but it works very well when you have a larger number of registers. >>> >>>-Matt >> >> >>Right (to your last paragraph). And it is a major reason why OOOE is worthwhile >>on >>the X86, because the hardware can rename the registers, lift the instructions >>and execute >>them earlier so that by the time the data is needed, it is available in a >>"hidden" register >>that suddenly has the needed data at the right time. >> >>No way to do that in a compiler. Unless the architecture has enough registers >>so that >>you don't run out with all the pre-loads. > >I can think of few functions that have 128 intermediate computations, even if >they do heavy computation. The compiler often won't need to throw away any >intermediates. Matrix manipulations and quaternion math come foremost to mind as >good examples. Spaghetti code wouldn't have too much trouble in creating poor >performance, but that also has an obvious rebuttle. > >-Matt Right, but on the Cray T90, for example, a memory read takes 50 cycles. So I need to "hoist" the memory load back up the I-stream 50 instructions at least. Which means that for the next 50 instructions, that register is "off limits" for use since it already has a read scheduled and pending. When you have that kind of delay, you can tie up a bunch of registers for a long period of time...
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.