Author: Vincent Diepeveen
Date: 19:34:53 10/23/03
Go up one level in this thread
On October 22, 2003 at 12:29:16, Eugene Nalimov wrote:
Talking about microsoft visual c++.
How do you do PGO with it?
I'm looking into 7.1 2003 net release and can't find it????????
Note it kicks the hell out of visual c++ 6.0 sp5 procpack with > 5% for DIEP.
5% faster at a K7 without PGO is really *a lot*.
From the visual c++ for itanium compiler however we hear very little. Can't find
it at specint. How's it doing there?
>On October 22, 2003 at 11:28:09, Gerd Isenberg wrote:
>
>>On October 22, 2003 at 03:33:05, Daniel Clausen wrote:
>>
>>>On October 21, 2003 at 15:29:19, Eugene Nalimov wrote:
>>>
>>>>On Itanium integer registers are actually 65 bits wide. 64 bits for data and one
>>>>NAT (not a thing) bit.
>>>>
>>>>:-)
>>>>
>>>>Thanks,
>>>>Eugene
>>>
>>>If there is a way to use this bit for yourself too, I'm sure Gerd will come up
>>>with another cool new algorithm! :)
>>>
>>>Sargon
>>
>>;-)
>>
>>I'm really not familar with this very interesting processor architecture. It has
>>an integer register file of 128! * 64+NaT. It seems well designed to do a lot of
>>parallel fill cycles.
>>I guess a set NaT-bit may trigger some exceptions/interrupts if you do some
>>operations with uninitialited registers, allowing some lowlevel try-catch like
>>control structures (including stack rewind?).
>>
>>Gerd
>
>NaT bit is there to support control speculation. It allows compiler to move load
>from memory across the branch.
>
>Look for the following function:
> void foo (int *a, int *b, int *p, int i, int j)
> {
> if (a[i+1] == b[j+1])
> (*p) ++;
> }
>Without speculation you have to first check the condition, and only then load
>*p. With speculation you can load *p in parallel with computations of a[i+1] and
>b[j+1].
>
>Here is asm file generated by Visual C:
>
>// Listing generated by Microsoft (R) Optimizing Compiler Version 14.00.31020
>
>...[I removed some lines that are not interesting for this discussion]
>
>// Begin code for function: foo:
> .proc foo#
> .align 32
>foo:
>// a$ = r32
>// b$ = r33
>// p$ = r34
>// i$ = r35
>// j$ = r36
>// Output regs: None
>// File c:\repro\m.c
> { .mii //R-Addr: 0X00
> ld4.s r29=[r34] //4.R cc:0
> sxt4 r31=r36 //3. cc:0
> sxt4 r30=r35;; //3. cc:0
> }
> { .mii //R-Addr: 0X010
> adds r27=1, r31 //3. cc:1
> adds r26=1, r30 //3. cc:1
> adds r28=1, r29;; //4.R cc:1
> }
> { .mib //R-Addr: 0X020
> shladd r25=r27, 2, r33 //3. cc:2
> shladd r22=r26, 2, r32 //3. cc:2
> nop.b 0;;
> }
> { .mmb //R-Addr: 0X030
> ld4 r21=[r25] //3. cc:3
> ld4 r20=[r22] //3. cc:3
> nop.b 0;;
> }
> { .mmi //R-Addr: 0X040
> cmp4.ne.unc p0,p6=r20, r21;; //3. cc:4
> (p6) chk.s.m r29, foo$2# //4.I cc:5
> nop.i 0
> }
>foo$1: // Recovery label
> { .mmb //R-Addr: 0X050
> (p6) st4 [r34]=r28 //4.I cc:5
> nop.m 0
> br.ret.sptk.many b0;; //5 cc:5
> }
>
>// Scenario dead code below
>foo$3:
>foo$2: // Recovery code
> { .mmi //R-Addr: 0X060
> ld4 r29=[r34];; //4 cc:0
> adds r28=1, r29 //4 cc:1
> nop.i 0
> }
> { .mmb //R-Addr: 0X070
> nop.m 0
> nop.m 0
> br.cond.sptk.few foo$1#;; //4 cc:1
> }
>// End code for function:
> .endp foo#
>
>As you can see, speculative load is first instruction in the function. if for
>whatever reason speculative load failed, NaT bit for r29 will be set.
>
>We increment loaded value in parallel with other computations (6th instruction).
>If NaT bit on r29 is set, NaT bit on r28 will be set, too, indicating that value
>in that register is undefined, too.
>
>Instruction at address 0x40 compares a[i+1] and b[j+1]. Here you can see another
>Itanium architectural feature -- predication. There is no branch (and no
>potential branch misprediction), instead two instructions are predicated by
>predicate register p6. That register is set if a[i+1]==b[j+1].
>
>So, if a[i+1]==b[j+1] we execute
> (p6) chk.s.m r29, foo$2# //4.I cc:5
> (p6) st4 [r34]=r28 //4.I cc:5
>
>First instruction checks NaT bit on r29, and if register contains undefined
>value, control goes to the label foo$2. As you can see, "recovery code" there
>unconditionally reloads value from memory, increments, it, and branches back
>into main function body.
>
>Here is what happens if I use the (debug) flag that forces compiler not to use
>control speculation:
>
>foo:
>// a$ = r32
>// b$ = r33
>// p$ = r34
>// i$ = r35
>// j$ = r36
>// Output regs: None
>// File c:\repro\m.c
> { .mii //R-Addr: 0X00
> nop.m 0
> sxt4 r31=r36 //3. cc:0
> sxt4 r30=r35;; //3. cc:0
> }
> { .mii //R-Addr: 0X010
> adds r29=1, r31 //3. cc:1
> adds r28=1, r30;; //3. cc:1
> shladd r27=r29, 2, r33 //3. cc:2
> }
> { .mmi //R-Addr: 0X020
> shladd r26=r28, 2, r32;; //3. cc:2
> ld4 r25=[r27] //3. cc:3
> nop.i 0
> }
> { .mmi //R-Addr: 0X030
> ld4 r22=[r26];; //3. cc:3
> cmp4.ne.unc p0,p6=r22, r25 //3. cc:4
> nop.i 0;;
> }
> { .mmi //R-Addr: 0X040
> (p6) ld4.bias r21=[r34];; //4.I cc:5
> (p6) adds r20=1, r21 //4.I cc:6
> nop.i 0;;
> }
> { .mmb //R-Addr: 0X050
> (p6) st4 [r34]=r20 //4.I cc:7
> nop.m 0
> br.ret.sptk.many b0;; //5 cc:7
> }
>
>As you see, load and increment happens only after we are sure that
>a[i+1]==b[j+1]. As a result, in "normal" case function takes 2 extra cycles
>("cc" is compiler's estimation of cycle count).
>
>Thanks,
>Eugene
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.