Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: 65 bits!

Author: Gerd Isenberg
Date: 11:04:01 10/22/03
<snip>

>NaT bit is there to support control speculation. It allows compiler to move load
>from memory across the branch.
>
>Look for the following function:
>    void foo (int *a, int *b, int *p, int i, int j)
>    {
>        if (a[i+1] == b[j+1])
>            (*p) ++;
>    }
>Without speculation you have to first check the condition, and only then load
>*p. With speculation you can load *p in parallel with computations of a[i+1] and
>b[j+1].


Interesting.

>
>Here is asm file generated by Visual C:
>
>// Listing generated by Microsoft (R) Optimizing Compiler Version 14.00.31020
>
>...[I removed some lines that are not interesting for this discussion]
>
>// Begin code for function: foo:
>	.proc	foo#
>	.align 32
>foo:
>// a$ = r32
>// b$ = r33
>// p$ = r34
>// i$ = r35
>// j$ = r36
>// Output regs: None
>// File c:\repro\m.c
> {   .mii  //R-Addr: 0X00
>	ld4.s	r29=[r34]       // *p			    //4.R	cc:0
>	sxt4	r31=r36		// j			    //3.	cc:0
>	sxt4	r30=r35;;	// i			    //3.	cc:0
> }

Does the 4 mean bytesize == 32bit?


> {   .mii  //R-Addr: 0X010
>	adds	r27=1, r31	// j+1			    //3.	cc:1
>	adds	r26=1, r30	// i+1			    //3.	cc:1
>	adds	r28=1, r29;;	// (*p)++		    //4.R	cc:1
> }
> {   .mib  //R-Addr: 0X020
>	shladd	r25=r27, 2, r33				    //3.	cc:2
>	shladd	r22=r26, 2, r32				    //3.	cc:2

// r22 = r32 + (r26 << 2)
// ==>  (int*)a + i + 1


i guess nops for 16-byte alignment?

>	nop.b	 0;;
> }
> {   .mmb  //R-Addr: 0X030
>	ld4	r21=[r25]				    //3.	cc:3
>	ld4	r20=[r22]				    //3.	cc:3
>	nop.b	 0;;
> }
> {   .mmi  //R-Addr: 0X040
>	cmp4.ne.unc p0,p6=r20, r21;;			    //3.	cc:4
>  (p6)	chk.s.m	 r29, foo$2#				    //4.I	cc:5
>	nop.i	 0
> }
>foo$1:							    // Recovery label
> {   .mmb  //R-Addr: 0X050
>  (p6)	st4	[r34]=r28				    //4.I	cc:5
>	nop.m	 0
>	br.ret.sptk.many b0;;				    //5 	cc:5
> }
>
>// Scenario dead code below
>foo$3:
>foo$2:							    // Recovery code
> {   .mmi  //R-Addr: 0X060
>	ld4	r29=[r34];;				    //4 	cc:0
>	adds	r28=1, r29				    //4 	cc:1
>	nop.i	 0
> }
> {   .mmb  //R-Addr: 0X070
>	nop.m	 0
>	nop.m	 0
>	br.cond.sptk.few foo$1#;;			    //4 	cc:1
> }
>// End code for function:
>	.endp	foo#
>
>As you can see, speculative load is first instruction in the function. if for
>whatever reason speculative load failed, NaT bit for r29 will be set.


I see, e.g. NaT got set if p == NULL.


>
>We increment loaded value in parallel with other computations (6th instruction).
>If NaT bit on r29 is set, NaT bit on r28 will be set, too, indicating that value
>in that register is undefined, too.
>
>Instruction at address 0x40 compares a[i+1] and b[j+1]. Here you can see another
>Itanium architectural feature -- predication. There is no branch (and no
>potential branch misprediction), instead two instructions are predicated by
>predicate register p6. That register is set if a[i+1]==b[j+1].
>
>So, if a[i+1]==b[j+1] we execute
>  (p6)	chk.s.m	 r29, foo$2#				    //4.I	cc:5
>  (p6)	st4	[r34]=r28				    //4.I	cc:5
>
>First instruction checks NaT bit on r29, and if register contains undefined
>value, control goes to the label foo$2. As you can see, "recovery code" there
>unconditionally reloads value from memory, increments, it, and branches back
>into main function body.
>
>Here is what happens if I use the (debug) flag that forces compiler not to use
>control speculation:
>
>foo:
>// a$ = r32
>// b$ = r33
>// p$ = r34
>// i$ = r35
>// j$ = r36
>// Output regs: None
>// File c:\repro\m.c
> {   .mii  //R-Addr: 0X00
>	nop.m	 0
>	sxt4	r31=r36					    //3.	cc:0
>	sxt4	r30=r35;;				    //3.	cc:0
> }
> {   .mii  //R-Addr: 0X010
>	adds	r29=1, r31				    //3.	cc:1
>	adds	r28=1, r30;;				    //3.	cc:1
>	shladd	r27=r29, 2, r33				    //3.	cc:2
> }
> {   .mmi  //R-Addr: 0X020
>	shladd	r26=r28, 2, r32;;			    //3.	cc:2
>	ld4	r25=[r27]				    //3.	cc:3
>	nop.i	 0
> }
> {   .mmi  //R-Addr: 0X030
>	ld4	r22=[r26];;				    //3.	cc:3
>	cmp4.ne.unc p0,p6=r22, r25			    //3.	cc:4
>	nop.i	 0;;
> }
> {   .mmi  //R-Addr: 0X040
>  (p6)	ld4.bias r21=[r34];;				    //4.I	cc:5
>  (p6)	adds	r20=1, r21				    //4.I	cc:6
>	nop.i	 0;;
> }
> {   .mmb  //R-Addr: 0X050
>  (p6)	st4	[r34]=r20				    //4.I	cc:7
>	nop.m	 0
>	br.ret.sptk.many b0;;				    //5 	cc:7
> }
>
>As you see, load and increment happens only after we are sure that
>a[i+1]==b[j+1]. As a result, in "normal" case function takes 2 extra cycles
>("cc" is compiler's estimation of cycle count).
>
>Thanks,
>Eugene


And that thing has 128 64-bit registers and MMX too.
I should have a closer look at it, seems to be a perfect
architecture for Fill based bitboard programs ;-)

Thanks for the lesson!

Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.