Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: 65 bits!

Author: Vincent Diepeveen

Date: 19:34:53 10/23/03

Go up one level in this thread


On October 22, 2003 at 12:29:16, Eugene Nalimov wrote:

Talking about microsoft visual c++.

How do you do PGO with it?

I'm looking into 7.1 2003 net release and can't find it????????

Note it kicks the hell out of visual c++ 6.0 sp5 procpack with > 5% for DIEP.

5% faster at a K7 without PGO is really *a lot*.

From the visual c++ for itanium compiler however we hear very little. Can't find
it at specint. How's it doing there?

>On October 22, 2003 at 11:28:09, Gerd Isenberg wrote:
>
>>On October 22, 2003 at 03:33:05, Daniel Clausen wrote:
>>
>>>On October 21, 2003 at 15:29:19, Eugene Nalimov wrote:
>>>
>>>>On Itanium integer registers are actually 65 bits wide. 64 bits for data and one
>>>>NAT (not a thing) bit.
>>>>
>>>>:-)
>>>>
>>>>Thanks,
>>>>Eugene
>>>
>>>If there is a way to use this bit for yourself too, I'm sure Gerd will come up
>>>with another cool new algorithm! :)
>>>
>>>Sargon
>>
>>;-)
>>
>>I'm really not familar with this very interesting processor architecture. It has
>>an integer register file of 128! * 64+NaT. It seems well designed to do a lot of
>>parallel fill cycles.
>>I guess a set NaT-bit may trigger some exceptions/interrupts if you do some
>>operations with uninitialited registers, allowing some lowlevel try-catch like
>>control structures (including stack rewind?).
>>
>>Gerd
>
>NaT bit is there to support control speculation. It allows compiler to move load
>from memory across the branch.
>
>Look for the following function:
>    void foo (int *a, int *b, int *p, int i, int j)
>    {
>        if (a[i+1] == b[j+1])
>            (*p) ++;
>    }
>Without speculation you have to first check the condition, and only then load
>*p. With speculation you can load *p in parallel with computations of a[i+1] and
>b[j+1].
>
>Here is asm file generated by Visual C:
>
>// Listing generated by Microsoft (R) Optimizing Compiler Version 14.00.31020
>
>...[I removed some lines that are not interesting for this discussion]
>
>// Begin code for function: foo:
>	.proc	foo#
>	.align 32
>foo:
>// a$ = r32
>// b$ = r33
>// p$ = r34
>// i$ = r35
>// j$ = r36
>// Output regs: None
>// File c:\repro\m.c
> {   .mii  //R-Addr: 0X00
>	ld4.s	r29=[r34]				    //4.R	cc:0
>	sxt4	r31=r36					    //3.	cc:0
>	sxt4	r30=r35;;				    //3.	cc:0
> }
> {   .mii  //R-Addr: 0X010
>	adds	r27=1, r31				    //3.	cc:1
>	adds	r26=1, r30				    //3.	cc:1
>	adds	r28=1, r29;;				    //4.R	cc:1
> }
> {   .mib  //R-Addr: 0X020
>	shladd	r25=r27, 2, r33				    //3.	cc:2
>	shladd	r22=r26, 2, r32				    //3.	cc:2
>	nop.b	 0;;
> }
> {   .mmb  //R-Addr: 0X030
>	ld4	r21=[r25]				    //3.	cc:3
>	ld4	r20=[r22]				    //3.	cc:3
>	nop.b	 0;;
> }
> {   .mmi  //R-Addr: 0X040
>	cmp4.ne.unc p0,p6=r20, r21;;			    //3.	cc:4
>  (p6)	chk.s.m	 r29, foo$2#				    //4.I	cc:5
>	nop.i	 0
> }
>foo$1:							    // Recovery label
> {   .mmb  //R-Addr: 0X050
>  (p6)	st4	[r34]=r28				    //4.I	cc:5
>	nop.m	 0
>	br.ret.sptk.many b0;;				    //5 	cc:5
> }
>
>// Scenario dead code below
>foo$3:
>foo$2:							    // Recovery code
> {   .mmi  //R-Addr: 0X060
>	ld4	r29=[r34];;				    //4 	cc:0
>	adds	r28=1, r29				    //4 	cc:1
>	nop.i	 0
> }
> {   .mmb  //R-Addr: 0X070
>	nop.m	 0
>	nop.m	 0
>	br.cond.sptk.few foo$1#;;			    //4 	cc:1
> }
>// End code for function:
>	.endp	foo#
>
>As you can see, speculative load is first instruction in the function. if for
>whatever reason speculative load failed, NaT bit for r29 will be set.
>
>We increment loaded value in parallel with other computations (6th instruction).
>If NaT bit on r29 is set, NaT bit on r28 will be set, too, indicating that value
>in that register is undefined, too.
>
>Instruction at address 0x40 compares a[i+1] and b[j+1]. Here you can see another
>Itanium architectural feature -- predication. There is no branch (and no
>potential branch misprediction), instead two instructions are predicated by
>predicate register p6. That register is set if a[i+1]==b[j+1].
>
>So, if a[i+1]==b[j+1] we execute
>  (p6)	chk.s.m	 r29, foo$2#				    //4.I	cc:5
>  (p6)	st4	[r34]=r28				    //4.I	cc:5
>
>First instruction checks NaT bit on r29, and if register contains undefined
>value, control goes to the label foo$2. As you can see, "recovery code" there
>unconditionally reloads value from memory, increments, it, and branches back
>into main function body.
>
>Here is what happens if I use the (debug) flag that forces compiler not to use
>control speculation:
>
>foo:
>// a$ = r32
>// b$ = r33
>// p$ = r34
>// i$ = r35
>// j$ = r36
>// Output regs: None
>// File c:\repro\m.c
> {   .mii  //R-Addr: 0X00
>	nop.m	 0
>	sxt4	r31=r36					    //3.	cc:0
>	sxt4	r30=r35;;				    //3.	cc:0
> }
> {   .mii  //R-Addr: 0X010
>	adds	r29=1, r31				    //3.	cc:1
>	adds	r28=1, r30;;				    //3.	cc:1
>	shladd	r27=r29, 2, r33				    //3.	cc:2
> }
> {   .mmi  //R-Addr: 0X020
>	shladd	r26=r28, 2, r32;;			    //3.	cc:2
>	ld4	r25=[r27]				    //3.	cc:3
>	nop.i	 0
> }
> {   .mmi  //R-Addr: 0X030
>	ld4	r22=[r26];;				    //3.	cc:3
>	cmp4.ne.unc p0,p6=r22, r25			    //3.	cc:4
>	nop.i	 0;;
> }
> {   .mmi  //R-Addr: 0X040
>  (p6)	ld4.bias r21=[r34];;				    //4.I	cc:5
>  (p6)	adds	r20=1, r21				    //4.I	cc:6
>	nop.i	 0;;
> }
> {   .mmb  //R-Addr: 0X050
>  (p6)	st4	[r34]=r20				    //4.I	cc:7
>	nop.m	 0
>	br.ret.sptk.many b0;;				    //5 	cc:7
> }
>
>As you see, load and increment happens only after we are sure that
>a[i+1]==b[j+1]. As a result, in "normal" case function takes 2 extra cycles
>("cc" is compiler's estimation of cycle count).
>
>Thanks,
>Eugene



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.