Author: Gerd Isenberg
Date: 11:04:01 10/22/03
Go up one level in this thread
<snip>
>NaT bit is there to support control speculation. It allows compiler to move load
>from memory across the branch.
>
>Look for the following function:
> void foo (int *a, int *b, int *p, int i, int j)
> {
> if (a[i+1] == b[j+1])
> (*p) ++;
> }
>Without speculation you have to first check the condition, and only then load
>*p. With speculation you can load *p in parallel with computations of a[i+1] and
>b[j+1].
Interesting.
>
>Here is asm file generated by Visual C:
>
>// Listing generated by Microsoft (R) Optimizing Compiler Version 14.00.31020
>
>...[I removed some lines that are not interesting for this discussion]
>
>// Begin code for function: foo:
> .proc foo#
> .align 32
>foo:
>// a$ = r32
>// b$ = r33
>// p$ = r34
>// i$ = r35
>// j$ = r36
>// Output regs: None
>// File c:\repro\m.c
> { .mii //R-Addr: 0X00
> ld4.s r29=[r34] // *p //4.R cc:0
> sxt4 r31=r36 // j //3. cc:0
> sxt4 r30=r35;; // i //3. cc:0
> }
Does the 4 mean bytesize == 32bit?
> { .mii //R-Addr: 0X010
> adds r27=1, r31 // j+1 //3. cc:1
> adds r26=1, r30 // i+1 //3. cc:1
> adds r28=1, r29;; // (*p)++ //4.R cc:1
> }
> { .mib //R-Addr: 0X020
> shladd r25=r27, 2, r33 //3. cc:2
> shladd r22=r26, 2, r32 //3. cc:2
// r22 = r32 + (r26 << 2)
// ==> (int*)a + i + 1
i guess nops for 16-byte alignment?
> nop.b 0;;
> }
> { .mmb //R-Addr: 0X030
> ld4 r21=[r25] //3. cc:3
> ld4 r20=[r22] //3. cc:3
> nop.b 0;;
> }
> { .mmi //R-Addr: 0X040
> cmp4.ne.unc p0,p6=r20, r21;; //3. cc:4
> (p6) chk.s.m r29, foo$2# //4.I cc:5
> nop.i 0
> }
>foo$1: // Recovery label
> { .mmb //R-Addr: 0X050
> (p6) st4 [r34]=r28 //4.I cc:5
> nop.m 0
> br.ret.sptk.many b0;; //5 cc:5
> }
>
>// Scenario dead code below
>foo$3:
>foo$2: // Recovery code
> { .mmi //R-Addr: 0X060
> ld4 r29=[r34];; //4 cc:0
> adds r28=1, r29 //4 cc:1
> nop.i 0
> }
> { .mmb //R-Addr: 0X070
> nop.m 0
> nop.m 0
> br.cond.sptk.few foo$1#;; //4 cc:1
> }
>// End code for function:
> .endp foo#
>
>As you can see, speculative load is first instruction in the function. if for
>whatever reason speculative load failed, NaT bit for r29 will be set.
I see, e.g. NaT got set if p == NULL.
>
>We increment loaded value in parallel with other computations (6th instruction).
>If NaT bit on r29 is set, NaT bit on r28 will be set, too, indicating that value
>in that register is undefined, too.
>
>Instruction at address 0x40 compares a[i+1] and b[j+1]. Here you can see another
>Itanium architectural feature -- predication. There is no branch (and no
>potential branch misprediction), instead two instructions are predicated by
>predicate register p6. That register is set if a[i+1]==b[j+1].
>
>So, if a[i+1]==b[j+1] we execute
> (p6) chk.s.m r29, foo$2# //4.I cc:5
> (p6) st4 [r34]=r28 //4.I cc:5
>
>First instruction checks NaT bit on r29, and if register contains undefined
>value, control goes to the label foo$2. As you can see, "recovery code" there
>unconditionally reloads value from memory, increments, it, and branches back
>into main function body.
>
>Here is what happens if I use the (debug) flag that forces compiler not to use
>control speculation:
>
>foo:
>// a$ = r32
>// b$ = r33
>// p$ = r34
>// i$ = r35
>// j$ = r36
>// Output regs: None
>// File c:\repro\m.c
> { .mii //R-Addr: 0X00
> nop.m 0
> sxt4 r31=r36 //3. cc:0
> sxt4 r30=r35;; //3. cc:0
> }
> { .mii //R-Addr: 0X010
> adds r29=1, r31 //3. cc:1
> adds r28=1, r30;; //3. cc:1
> shladd r27=r29, 2, r33 //3. cc:2
> }
> { .mmi //R-Addr: 0X020
> shladd r26=r28, 2, r32;; //3. cc:2
> ld4 r25=[r27] //3. cc:3
> nop.i 0
> }
> { .mmi //R-Addr: 0X030
> ld4 r22=[r26];; //3. cc:3
> cmp4.ne.unc p0,p6=r22, r25 //3. cc:4
> nop.i 0;;
> }
> { .mmi //R-Addr: 0X040
> (p6) ld4.bias r21=[r34];; //4.I cc:5
> (p6) adds r20=1, r21 //4.I cc:6
> nop.i 0;;
> }
> { .mmb //R-Addr: 0X050
> (p6) st4 [r34]=r20 //4.I cc:7
> nop.m 0
> br.ret.sptk.many b0;; //5 cc:7
> }
>
>As you see, load and increment happens only after we are sure that
>a[i+1]==b[j+1]. As a result, in "normal" case function takes 2 extra cycles
>("cc" is compiler's estimation of cycle count).
>
>Thanks,
>Eugene
And that thing has 128 64-bit registers and MMX too.
I should have a closer look at it, seems to be a perfect
architecture for Fill based bitboard programs ;-)
Thanks for the lesson!
Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.