Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: IA-64 vs OOOE (attn Taylor, Hyatt)

Author: Matt Taylor

Date: 14:02:02 02/12/03

Go up one level in this thread


On February 12, 2003 at 04:55:59, Tom Kerrigan wrote:

>On February 12, 2003 at 00:37:13, Robert Hyatt wrote:
>
>>No No No.  They do much of this with 100% accuracy.  Because they make sure
>>that the critical instructions are executed in _every_ path that reaches a
>>critical point in the data-flow analysis of the program (the dependency graph
>>for gcc users)...
>
>You're not making any sense. You have a branch. You have two possible control
>paths. The instructions in each path are different. Which ones do you advance?

The compiler can make better predictions because it can see the relative
probabilities. Static prediction can only guess, albeit with high probability
due to trends in control flow. Dyanmic prediction is still limited in this way.
I can write code that deliberately breaks dynamic prediction so it mispredicts
every single time. In fact, it's relatively easy to do:

void (*funcs)(void)[] = {...};

for(i = 0; i < (sizeof(funcs) / sizeof(*funcs)); i++)
    funcs[i]();

That will break branch prediction on any IA-32 chip. In fact, I don't know of
any branch prediction implementation that can compensate. (There may be, but I
stick with the software side of things.)

>>BTW OOOE has a huge limit.  Something like 40-60 (I don't have my P6/etc
>>manuals here at home) micro-ops in the reorder buffer.  No way to do any
>>OOOE beyond that very narrow peephole, while the compiler can see _much_
>>more, as much as it wants (and has the compile time) to look at...
>
>Alright. So run compiled code on your OOO processor.
>
>>registers when the real instructions get turned into micro-ops...  but at
>>least the latter is more a result of a horrible architecture (8 registers)
>>as opposed to the fact the OOO execution is a huge boon for other architectures
>>that are not so register-challenged...
>
>Funny, my 30% number was for the Alpha and MIPS chips. I wouldn't consider them
>register challenged.

When code is properly scheduled, OOOE has minimal impact. Every time I count
clocks in Athlon assembly making the assumption that no OOOE is occuring, I get
theoretical performance figures equal or within a small amount of error (+/- 5%)
of actual performance.

OOOE is very helpful on IA-32 in general because IA-32 optimization strategies
have changed dramatically since the days of the 386. Code on IA-32 is rarely
optimized for a specific instruction blend.

>>Sure.  But given the choice of OOOE with 8 int alus, or no OOOE with 16
>>int alus and an instruction package large enough to feed them all, I would
>>consider the latter seriously...
>
>We have chips today with 9 execution units that retire, on average, one
>instruction per cycle, and you think you can fill 16 in slots?

Partly true. Athlon and Pentium 4 can execute up to 9 ops/cycle. (In Athlon, 1
ALU instruction = macro-op whereas the instruction:micro-op relationship in
P6/P7 is not always 1.) Athlon has 9 execution units, only 3 of which are ALUs.
The other 6 resources go to FP units (2) and MMX/3DNow units (4). The Pentium 4
has 3 ALUs that can process up to 5 ops/cycle; the remaining 4 units cover FP,
MMX, and SSE. You neglected the fact that no IA-32 chip can decode more than 3
instructions/cycle. Thus it is limited to 3 IPC at best.

Also, if the Pentium 4 is retiring 1 IPC, the Athlon has to be retiring at least
1.5 IPC to keep pace at a lower clockrate. The actual performance relationship
is debatable, but I think 2:3 is pretty reasonable. The Athlon has IPC
comparable to the P6, though slightly higher. The Pentium 4 1.5 GHz was
reportedly as fast as a Pentium 3 1 GHz, and the latest Pentium 4 3.06 GHz still
runs close to the AthlonXP 3000 (2.13 GHz).

>>The Cray T932 was the last 64 bit machine they built that I used.  And it
>>can produce a FLOP count that no PC on the planet can come within a factor of
>>10 of and that is being very generous.  2ns clock, 32 cpus, each cpu can read
>>four words and write two words to memory per clock cycle, and with vector
>>chaining, it can do at _least_ eight floating point operations per cycle per
>>CPU.
>
>How many NPS does Crafty get on it?
>
>>I did a branchless FirstOne() in asm a few weeks back here, just to test.
>>It used a cmov, and it wasn't slower than the one with a branch.  If the
>
>On a Pentium III?

Pentium 3 decodes cmov into 2 u-ops + 1 load op when using memory. This is 3
cycles at worst. The original Pentium had penalties higher than this for a
branch mispredict, and the P6 uses a pipeline much deeper.

I timed the reg-reg form on my P2 350 MHz. Results are given for 32
instructions:
throughput: 63
latency: 65

So 2 cycles for cmov reg-reg. Not terribly expensive. If I understand correctly,
I can meanwhile issue other unrelated ops (OOOE plays a role here too).

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.