Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: 65 bits!

Author: Eugene Nalimov
Date: 09:29:16 10/22/03
On October 22, 2003 at 11:28:09, Gerd Isenberg wrote:

>On October 22, 2003 at 03:33:05, Daniel Clausen wrote:
>
>>On October 21, 2003 at 15:29:19, Eugene Nalimov wrote:
>>
>>>On Itanium integer registers are actually 65 bits wide. 64 bits for data and one
>>>NAT (not a thing) bit.
>>>
>>>:-)
>>>
>>>Thanks,
>>>Eugene
>>
>>If there is a way to use this bit for yourself too, I'm sure Gerd will come up
>>with another cool new algorithm! :)
>>
>>Sargon
>
>;-)
>
>I'm really not familar with this very interesting processor architecture. It has
>an integer register file of 128! * 64+NaT. It seems well designed to do a lot of
>parallel fill cycles.
>I guess a set NaT-bit may trigger some exceptions/interrupts if you do some
>operations with uninitialited registers, allowing some lowlevel try-catch like
>control structures (including stack rewind?).
>
>Gerd

NaT bit is there to support control speculation. It allows compiler to move load
from memory across the branch.

Look for the following function:
    void foo (int *a, int *b, int *p, int i, int j)
    {
        if (a[i+1] == b[j+1])
            (*p) ++;
    }
Without speculation you have to first check the condition, and only then load
*p. With speculation you can load *p in parallel with computations of a[i+1] and
b[j+1].

Here is asm file generated by Visual C:

// Listing generated by Microsoft (R) Optimizing Compiler Version 14.00.31020

...[I removed some lines that are not interesting for this discussion]

// Begin code for function: foo:
	.proc	foo#
	.align 32
foo:
// a$ = r32
// b$ = r33
// p$ = r34
// i$ = r35
// j$ = r36
// Output regs: None
// File c:\repro\m.c
 {   .mii  //R-Addr: 0X00
	ld4.s	r29=[r34]				    //4.R	cc:0
	sxt4	r31=r36					    //3.	cc:0
	sxt4	r30=r35;;				    //3.	cc:0
 }
 {   .mii  //R-Addr: 0X010
	adds	r27=1, r31				    //3.	cc:1
	adds	r26=1, r30				    //3.	cc:1
	adds	r28=1, r29;;				    //4.R	cc:1
 }
 {   .mib  //R-Addr: 0X020
	shladd	r25=r27, 2, r33				    //3.	cc:2
	shladd	r22=r26, 2, r32				    //3.	cc:2
	nop.b	 0;;
 }
 {   .mmb  //R-Addr: 0X030
	ld4	r21=[r25]				    //3.	cc:3
	ld4	r20=[r22]				    //3.	cc:3
	nop.b	 0;;
 }
 {   .mmi  //R-Addr: 0X040
	cmp4.ne.unc p0,p6=r20, r21;;			    //3.	cc:4
  (p6)	chk.s.m	 r29, foo$2#				    //4.I	cc:5
	nop.i	 0
 }
foo$1:							    // Recovery label
 {   .mmb  //R-Addr: 0X050
  (p6)	st4	[r34]=r28				    //4.I	cc:5
	nop.m	 0
	br.ret.sptk.many b0;;				    //5 	cc:5
 }

// Scenario dead code below
foo$3:
foo$2:							    // Recovery code
 {   .mmi  //R-Addr: 0X060
	ld4	r29=[r34];;				    //4 	cc:0
	adds	r28=1, r29				    //4 	cc:1
	nop.i	 0
 }
 {   .mmb  //R-Addr: 0X070
	nop.m	 0
	nop.m	 0
	br.cond.sptk.few foo$1#;;			    //4 	cc:1
 }
// End code for function:
	.endp	foo#

As you can see, speculative load is first instruction in the function. if for
whatever reason speculative load failed, NaT bit for r29 will be set.

We increment loaded value in parallel with other computations (6th instruction).
If NaT bit on r29 is set, NaT bit on r28 will be set, too, indicating that value
in that register is undefined, too.

Instruction at address 0x40 compares a[i+1] and b[j+1]. Here you can see another
Itanium architectural feature -- predication. There is no branch (and no
potential branch misprediction), instead two instructions are predicated by
predicate register p6. That register is set if a[i+1]==b[j+1].

So, if a[i+1]==b[j+1] we execute
  (p6)	chk.s.m	 r29, foo$2#				    //4.I	cc:5
  (p6)	st4	[r34]=r28				    //4.I	cc:5

First instruction checks NaT bit on r29, and if register contains undefined
value, control goes to the label foo$2. As you can see, "recovery code" there
unconditionally reloads value from memory, increments, it, and branches back
into main function body.

Here is what happens if I use the (debug) flag that forces compiler not to use
control speculation:

foo:
// a$ = r32
// b$ = r33
// p$ = r34
// i$ = r35
// j$ = r36
// Output regs: None
// File c:\repro\m.c
 {   .mii  //R-Addr: 0X00
	nop.m	 0
	sxt4	r31=r36					    //3.	cc:0
	sxt4	r30=r35;;				    //3.	cc:0
 }
 {   .mii  //R-Addr: 0X010
	adds	r29=1, r31				    //3.	cc:1
	adds	r28=1, r30;;				    //3.	cc:1
	shladd	r27=r29, 2, r33				    //3.	cc:2
 }
 {   .mmi  //R-Addr: 0X020
	shladd	r26=r28, 2, r32;;			    //3.	cc:2
	ld4	r25=[r27]				    //3.	cc:3
	nop.i	 0
 }
 {   .mmi  //R-Addr: 0X030
	ld4	r22=[r26];;				    //3.	cc:3
	cmp4.ne.unc p0,p6=r22, r25			    //3.	cc:4
	nop.i	 0;;
 }
 {   .mmi  //R-Addr: 0X040
  (p6)	ld4.bias r21=[r34];;				    //4.I	cc:5
  (p6)	adds	r20=1, r21				    //4.I	cc:6
	nop.i	 0;;
 }
 {   .mmb  //R-Addr: 0X050
  (p6)	st4	[r34]=r20				    //4.I	cc:7
	nop.m	 0
	br.ret.sptk.many b0;;				    //5 	cc:7
 }

As you see, load and increment happens only after we are sure that
a[i+1]==b[j+1]. As a result, in "normal" case function takes 2 extra cycles
("cc" is compiler's estimation of cycle count).

Thanks,
Eugene
Re: 65 bits! Vincent Diepeveen 19:34:53 10/23/03
- Re: 65 bits! Matthew White 12:49:10 10/27/03
  - Re: 65 bits! Eugene Nalimov 16:08:19 10/27/03
Re: 65 bits! Gerd Isenberg 11:04:01 10/22/03
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.