Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: copy cost

Author: Christophe Theron

Date: 11:13:48 08/23/03

Go up one level in this thread


On August 23, 2003 at 09:40:45, Gerd Isenberg wrote:

>On August 23, 2003 at 04:21:28, Johan de Koning wrote:
>
>>On August 23, 2003 at 03:45:09, Johan de Koning wrote:
>>
>>> ... 1 extra line in main() can
>>>easily change the runtime by 1 or 2% (for reasons I haven't fathomed yet).
>>
>>I mean: I do understand it depends on code alignment.
>>I can imagine the instruction pipeline feeds at very high speed from an "open"
>>cache line. I can also imagine it is rather complicated to have more than 1
>>cache line "open". But I can't imagine why I get random results.
>>
>>/**/ for( i = 0; i < top; i++ ) sum += i;
>>;;;; more: add, inc, cmp, jl more
>>
>>This loop usually executes in 2 cycles. But depending on the alignment I get
>>somtimes 2.667 or 4 or 4.5 cycles. Isn't that weird?!
>>
>>... Johan
>
>Hi Johan,
>
>Ok, your loop body is about 10 bytes.
>If i look to AMD Athlon Processor
>x86 Code Optimization Guide TM Page it becomes clearer.
>I guess P4 is similar.
>
>Regards,
>Gerd
>
>
>Page 49
>
>4 Instruction Decoding Optimizations
>...
>
>Overview
>--------------------------------------------------------------
>The AMD Athlon processor instruction fetcher reads 16-byte
>aligned code windows from the instruction cache. The
>instruction bytes are then merged into a 24-byte instruction
>queue. On each cycle, the in-order front-end engine selects for
>decode up to three x86 instructions from the instruction-byte
>queue.
>
>....
>
>and Page 54
>
>Align Branch Targets in Program Hot Spots
>
>In program hot spots (as determined by either profiling or loop
>nesting analysis), place branch targets at or near the beginning
>of 16-byte aligned code windows. This guideline improves
>performance inside hotspots by maximizing the number of
>instructions fills into the instruction-byte queue and preserves
>I-cache space in branch intensive code outside such hotspots.



Talk about a useful advice!

I should review all my code and place branch targets at 16 bytes boundaries, it
sounds so simple.

And every time I make a slight change, do all the work again.

And for each different processors families there must be different optimization
recipes like this one.

I prefer to NOT think about it. :)



    Christophe



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.