Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: IA-64 vs OOOE (attn Taylor, Hyatt)

Author: Matt Taylor

Date: 22:44:51 02/12/03

Go up one level in this thread


On February 12, 2003 at 16:03:21, Tom Kerrigan wrote:

>On February 12, 2003 at 11:33:24, Robert Hyatt wrote:
>
>>>Of course it's related. Compilers have to rely on static branch prediction (80%
>>>accuracy) if they're going to effectively advance instructions before
>>You keep saying this but it isn't true.  I can pull _any_ instruction and insert
>>it before
>>a branch, assuming it is architecturally feasible.  On a sparc, with 32 GP
>
>Okay, so let's say you advance instructions from both branches. Maybe you have
>enough execution resources to do that, i.e., you have enough slots to handle all
>of the ILP before and after the branch (both paths). What about the branches
>after that? It's not uncommon for Pentiums to pair instructions with 2 or 3
>branches between them. You have to run into issue contraints at some point.
>
>Occam's razor. Why is every high performance processor these days OOO, except
>for IA-64 and SPARC? MIPS, POWER, Alpha, PA-RISC. None of these chips are
>register starved and there are excellent compilers for all of them (esp.
>PA-RISC) but the chip designers still saw value in going way out of their way to
>make them OOO. That's not easy. They must have seen some value in it. Do you
>think they just did it on a whim? And is it just a coincidence that the US3 is
>so slow?
>
>>>Indeed. It's a shame only IA-64 chips run compiled code... oh, wait...
>>Notice his point, however.  The OOOE can only execute what it can "see".  Which
>>is
>>typically a pretty narrow "peephole" into the machine language instructions.
>>Compilers
>
>I understand his point but you don't understand mine. The compiler tricks you're
>discussing also increase performance of OOO processors. I mean, OOO processors
>don't have special logic to stall on instructions that are too far apart in the
>source code.

No, but doing OOOE optimizations at compile-time makes OOOE hardware at runtime
less useful because the compiler is doing what the OOOE silicon would have done.
No need for OOOE hardware if you have your compiler do it. If the compiler is
good, the OOOE hardware won't be able to extract much extra performance.

>>>>No. Predication is the IA-64's answer to branch prediction. Predication is
>>>>completely unrelated to OOOE.
>>>What, exactly, do you think the point of predication is, then? It's to allow
>>>instructions to execute before the condition is determined, in other words, out
>>>of order. (Or at least in order without being dependent.) If you think
>>But they _are_ different.  Predication just says "do all of this crap and we'll
>>sort out later
>>which was crap and which was important."  A compiler can do this on an old 286,
>
>The point of predication is to eliminate dependency on a branch. How can a
>compiler do this? In other words, how can a compiler say "we'll sort out later
>which was crap and which was important" without a branch on a non-predicated
>ISA?
<snip>

Some sequences more preferred than others, but all legitimate examples
demonstrating a couple x86 tricks. I don't think anyone here will argue that
IA-32 is a predicated ISA.

; abs(i), i = eax, result = eax
abs:
        cdq
        xor    eax, edx
        sub    eax, edx

; abs(i), i = eax, result = eax
abs:
        xor    edx, edx
        cmp    eax, 0
        setge  dl
        dec    edx
        xor    eax, edx
        sub    eax, edx

; abs(i), i = eax, result = eax
abs:
        mov    edx, eax
        neg    edx
        cmp    eax, 0
        cmovl  eax, edx

; min(x,y), x = eax, y = edx, result = eax
min:
        cmp    eax, edx
        cmovg  eax, edx

; min(x,y), x = eax, y = edx, result = eax
min:
        cmp    eax, edx
        sbb    ecx, ecx
        and    eax, ecx
        not    ecx
        and    edx, ecx
        or     eax, edx

; max(x,y), x = eax, y = edx, result = eax
max:
        cmp    edx, eax
        sbb    ecx, ecx
        and    eax, ecx
        not    ecx
        and    edx, ecx
        or     eax, edx

I particularly like min & max as they make use of tricks available to most
architectures. The cdq can be emulated by copying eax to edx and using sar.
Those tricks can be applied to any code that does a little computation in
branches. Here is an example similar to one Vincent gave me a long time ago:

// some computation that doesn't need a lot of exec. resources (window is
computed by mispredict probability)
if (a < b && a != c && (b & c) == a)
    eval += (a + b) * 3;

Written in x86 asm without using setcc or cmovcc to demonstrate universal
technique:

; Compute condition (setcc also works nice on x86)
; I have avoided hoisting & scheduling optimizations to make this code clear.
lea    val, [a+b]
lea    val, [val+val*2]

cmp    a, b
sbb    mask, mask

and    b, c
sub    b, a

; Equality comparisons are annoying.
sub    c, a
neg    c
sbb    c, c
and    mask, c

; Normally this uses andn, but x86 has no integer andn
neg    b
sbb    b, b
not    b
and    mask, b

and    val, mask
add    eval, val

I count 16 instructions. There is a lot of room for static OOOE scheduling. I
will assert that this code can probably hit somewhere between 2 and 3 IPC,
potentially higher on other architectures. I've only used basic ALU ops -- not,
and, negate, subtract, subtract with borrow, compare, and lea tricks that are
equivalent to a three-operand add.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.