Author: Eugene Nalimov
Date: 09:29:16 10/22/03
Go up one level in this thread
On October 22, 2003 at 11:28:09, Gerd Isenberg wrote:
>On October 22, 2003 at 03:33:05, Daniel Clausen wrote:
>
>>On October 21, 2003 at 15:29:19, Eugene Nalimov wrote:
>>
>>>On Itanium integer registers are actually 65 bits wide. 64 bits for data and one
>>>NAT (not a thing) bit.
>>>
>>>:-)
>>>
>>>Thanks,
>>>Eugene
>>
>>If there is a way to use this bit for yourself too, I'm sure Gerd will come up
>>with another cool new algorithm! :)
>>
>>Sargon
>
>;-)
>
>I'm really not familar with this very interesting processor architecture. It has
>an integer register file of 128! * 64+NaT. It seems well designed to do a lot of
>parallel fill cycles.
>I guess a set NaT-bit may trigger some exceptions/interrupts if you do some
>operations with uninitialited registers, allowing some lowlevel try-catch like
>control structures (including stack rewind?).
>
>Gerd
NaT bit is there to support control speculation. It allows compiler to move load
from memory across the branch.
Look for the following function:
void foo (int *a, int *b, int *p, int i, int j)
{
if (a[i+1] == b[j+1])
(*p) ++;
}
Without speculation you have to first check the condition, and only then load
*p. With speculation you can load *p in parallel with computations of a[i+1] and
b[j+1].
Here is asm file generated by Visual C:
// Listing generated by Microsoft (R) Optimizing Compiler Version 14.00.31020
...[I removed some lines that are not interesting for this discussion]
// Begin code for function: foo:
.proc foo#
.align 32
foo:
// a$ = r32
// b$ = r33
// p$ = r34
// i$ = r35
// j$ = r36
// Output regs: None
// File c:\repro\m.c
{ .mii //R-Addr: 0X00
ld4.s r29=[r34] //4.R cc:0
sxt4 r31=r36 //3. cc:0
sxt4 r30=r35;; //3. cc:0
}
{ .mii //R-Addr: 0X010
adds r27=1, r31 //3. cc:1
adds r26=1, r30 //3. cc:1
adds r28=1, r29;; //4.R cc:1
}
{ .mib //R-Addr: 0X020
shladd r25=r27, 2, r33 //3. cc:2
shladd r22=r26, 2, r32 //3. cc:2
nop.b 0;;
}
{ .mmb //R-Addr: 0X030
ld4 r21=[r25] //3. cc:3
ld4 r20=[r22] //3. cc:3
nop.b 0;;
}
{ .mmi //R-Addr: 0X040
cmp4.ne.unc p0,p6=r20, r21;; //3. cc:4
(p6) chk.s.m r29, foo$2# //4.I cc:5
nop.i 0
}
foo$1: // Recovery label
{ .mmb //R-Addr: 0X050
(p6) st4 [r34]=r28 //4.I cc:5
nop.m 0
br.ret.sptk.many b0;; //5 cc:5
}
// Scenario dead code below
foo$3:
foo$2: // Recovery code
{ .mmi //R-Addr: 0X060
ld4 r29=[r34];; //4 cc:0
adds r28=1, r29 //4 cc:1
nop.i 0
}
{ .mmb //R-Addr: 0X070
nop.m 0
nop.m 0
br.cond.sptk.few foo$1#;; //4 cc:1
}
// End code for function:
.endp foo#
As you can see, speculative load is first instruction in the function. if for
whatever reason speculative load failed, NaT bit for r29 will be set.
We increment loaded value in parallel with other computations (6th instruction).
If NaT bit on r29 is set, NaT bit on r28 will be set, too, indicating that value
in that register is undefined, too.
Instruction at address 0x40 compares a[i+1] and b[j+1]. Here you can see another
Itanium architectural feature -- predication. There is no branch (and no
potential branch misprediction), instead two instructions are predicated by
predicate register p6. That register is set if a[i+1]==b[j+1].
So, if a[i+1]==b[j+1] we execute
(p6) chk.s.m r29, foo$2# //4.I cc:5
(p6) st4 [r34]=r28 //4.I cc:5
First instruction checks NaT bit on r29, and if register contains undefined
value, control goes to the label foo$2. As you can see, "recovery code" there
unconditionally reloads value from memory, increments, it, and branches back
into main function body.
Here is what happens if I use the (debug) flag that forces compiler not to use
control speculation:
foo:
// a$ = r32
// b$ = r33
// p$ = r34
// i$ = r35
// j$ = r36
// Output regs: None
// File c:\repro\m.c
{ .mii //R-Addr: 0X00
nop.m 0
sxt4 r31=r36 //3. cc:0
sxt4 r30=r35;; //3. cc:0
}
{ .mii //R-Addr: 0X010
adds r29=1, r31 //3. cc:1
adds r28=1, r30;; //3. cc:1
shladd r27=r29, 2, r33 //3. cc:2
}
{ .mmi //R-Addr: 0X020
shladd r26=r28, 2, r32;; //3. cc:2
ld4 r25=[r27] //3. cc:3
nop.i 0
}
{ .mmi //R-Addr: 0X030
ld4 r22=[r26];; //3. cc:3
cmp4.ne.unc p0,p6=r22, r25 //3. cc:4
nop.i 0;;
}
{ .mmi //R-Addr: 0X040
(p6) ld4.bias r21=[r34];; //4.I cc:5
(p6) adds r20=1, r21 //4.I cc:6
nop.i 0;;
}
{ .mmb //R-Addr: 0X050
(p6) st4 [r34]=r20 //4.I cc:7
nop.m 0
br.ret.sptk.many b0;; //5 cc:7
}
As you see, load and increment happens only after we are sure that
a[i+1]==b[j+1]. As a result, in "normal" case function takes 2 extra cycles
("cc" is compiler's estimation of cycle count).
Thanks,
Eugene
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.