Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: copy cost

Author: Gerd Isenberg

Date: 12:42:12 08/23/03

Go up one level in this thread


On August 23, 2003 at 14:13:48, Christophe Theron wrote:

>On August 23, 2003 at 09:40:45, Gerd Isenberg wrote:
>
>>On August 23, 2003 at 04:21:28, Johan de Koning wrote:
>>
>>>On August 23, 2003 at 03:45:09, Johan de Koning wrote:
>>>
>>>> ... 1 extra line in main() can
>>>>easily change the runtime by 1 or 2% (for reasons I haven't fathomed yet).
>>>
>>>I mean: I do understand it depends on code alignment.
>>>I can imagine the instruction pipeline feeds at very high speed from an "open"
>>>cache line. I can also imagine it is rather complicated to have more than 1
>>>cache line "open". But I can't imagine why I get random results.
>>>
>>>/**/ for( i = 0; i < top; i++ ) sum += i;
>>>;;;; more: add, inc, cmp, jl more
>>>
>>>This loop usually executes in 2 cycles. But depending on the alignment I get
>>>somtimes 2.667 or 4 or 4.5 cycles. Isn't that weird?!
>>>
>>>... Johan
>>
>>Hi Johan,
>>
>>Ok, your loop body is about 10 bytes.
>>If i look to AMD Athlon Processor
>>x86 Code Optimization Guide TM Page it becomes clearer.
>>I guess P4 is similar.
>>
>>Regards,
>>Gerd
>>
>>
>>Page 49
>>
>>4 Instruction Decoding Optimizations
>>...
>>
>>Overview
>>--------------------------------------------------------------
>>The AMD Athlon processor instruction fetcher reads 16-byte
>>aligned code windows from the instruction cache. The
>>instruction bytes are then merged into a 24-byte instruction
>>queue. On each cycle, the in-order front-end engine selects for
>>decode up to three x86 instructions from the instruction-byte
>>queue.
>>
>>....
>>
>>and Page 54
>>
>>Align Branch Targets in Program Hot Spots
>>
>>In program hot spots (as determined by either profiling or loop
>>nesting analysis), place branch targets at or near the beginning
>>of 16-byte aligned code windows. This guideline improves
>>performance inside hotspots by maximizing the number of
>>instructions fills into the instruction-byte queue and preserves
>>I-cache space in branch intensive code outside such hotspots.
>
>
>
>Talk about a useful advice!
>
>I should review all my code and place branch targets at 16 bytes boundaries, it
>sounds so simple.
>
>And every time I make a slight change, do all the work again.
>
>And for each different processors families there must be different optimization
>recipes like this one.
>
>I prefer to NOT think about it. :)
>
>
>
>    Christophe



Hi Christophe

An explanation of the "weird" behaviour mentioned by Johan.

But don't worry, as long as you don't use assembler, compilers should handle
that for you, e.g. with the right settings for the target cpu, speed- or profile
guided optimization.

With (inline) assembler you have the "align xx" statement in MSVC6:

#ifdef X86_INLINE_ASM
#define  ALIGN_CODE(x) __asm align x
#else
#define  ALIGN_CODE(x)
#endif


ALIGN_CODE(16)
while (x)
 ...

Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.