Author: Matt Taylor
Date: 11:44:44 02/17/03
Go up one level in this thread
On February 17, 2003 at 13:34:11, Dezhi Zhao wrote:
>On February 17, 2003 at 03:58:13, Matt Taylor wrote:
>
>>On February 16, 2003 at 22:05:38, Dezhi Zhao wrote:
>>
>>>On February 16, 2003 at 20:08:00, Matt Taylor wrote:
>>>
>>>>On February 16, 2003 at 15:55:37, Dezhi Zhao wrote:
>>>>
>>>>>On February 16, 2003 at 07:37:31, Gerd Isenberg wrote:
>>>>>
>>>>>>On February 16, 2003 at 05:14:41, Matt Taylor wrote:
>>>>>>
>>>>>>>On February 14, 2003 at 23:52:09, Matt Taylor wrote:
>>>>>>>
>>>>>>>>On February 14, 2003 at 14:31:42, Gerd Isenberg wrote:
>>>>>>>>
>>>>>>>>>I tried two of your routines so far, the cmov/single-bsf one and the
>>>>>>>>>LSB_32_table lookup one. It's a mess, both without any branches, but IsiChess is
>>>>>>>>>stil 2-4% slower with these. I guess some chaotic code size dependent side
>>>>>>>>>effects (Athlon Xp 2000+, 512MB SDRAM).
>>>>>>>>>
>>>>>>>>>Regards,
>>>>>>>>>Gerd
>>>>>>>
>>>>>>>What -did- you use before? After thinking about it, I don't see how these
>>>>>>>routines could make such an impact. It would be very difficult to shrink them
>>>>>>>further.
>>>>>>>
>>>>>>>Have you tried not inlining them? That would make the resulting code smaller.
>>>>>>>
>>>>>>>-Matt
>>>>>>
>>>>>>Hi Matt,
>>>>>>
>>>>>>this one (posted already x times) is the routine i compared with:
>>>>>>
>>>>>>__forceinline
>>>>>>UINT BitSearchAndReset(BitBoard &bb)
>>>>>>{
>>>>>> __asm
>>>>>> {
>>>>>> mov ebx, [bb]
>>>>>> xor eax, eax
>>>>>> mov edx, 1
>>>>>> bsf ecx, [ebx]
>>>>>> jnz found
>>>>>> bsf ecx, [ebx + 4]
>>>>>> lea ebx, [ebx + 4]
>>>>>> xor eax, 32
>>>>>> found:
>>>>>> shl edx, cl
>>>>>> xor eax, ecx
>>>>>> xor [ebx], edx
>>>>>> }
>>>>>>
>>>>>>}
>>>>>>
>>>>>>To my surprise this conditional routine using four registers is faster than this
>>>>>>short unconditional one but using three vector path instructions:
>>>>>>
>>>>>>__forceinline
>>>>>>UINT BitSearchAndReset(BitBoard &bb)
>>>>>>{
>>>>>> __asm
>>>>>> {
>>>>>> mov edx, [bb]
>>>>>> bsf eax, [edx+4]
>>>>>> xor eax, 32
>>>>>> bsf eax, [edx]
>>>>>> btr [edx],eax
>>>>>> }
>>>>>>}
>>>>>>
>>>>>>With none inlining, your routines and even walters magic lsb64 lookup routine
>>>>>>are all about the same in average, but the nps and solution time performance is
>>>>>>significantly worse (4-10%).
>>>>>>
>>>>>>Regards,
>>>>>>Gerd
>>>>>
>>>>>This sounds quite nomral to me. If you choose to inline them, you shorter
>>>>>version will be faster. Matt's functions are optimal in context of functions.
>>>>>If you choose to inline them, you have to factor in register usages and out of
>>>>>order execuation. Your shoter version uses fewer registers and the processor can
>>>>>often find OOE oppotunities to hide the 2nd bsf instruction.
>>>>
>>>>First let me say that we are specifically talking about Athlon.
>>>>
>>>>On Athlon, VectorPath instructions (like bsf) decode into a series of macro-ops.
>>>>OOOE cannot hide the 8 cycles of latency this produces because the microcoded
>>>>bsf instruction consumes most (all?) of the execution resources. At times I have
>>>>noted the bsf instruction taking only 7 cycles. Perhaps this is an effect of
>>>>OOOE.
>>>>
>>>I do not have an Athlon. However I assume it implement bsf, bsr, btr the same as
>>>P4. Yes, it is dfficult to hide an instruction of 8 cycle latency. You can
>>>manage to hide part of it sometimes, if you have another slow instruction in the
>>>stream, e.g. a shift operation.
>>
>>Athlon uses a barrel shifter. Latency of 1 clock. I may be mistaken, but I
>>believe it has 1 attached to each of its 3 IEUs. Throughput of 3 shifts per
>>clock.
>>
>>I don't know how the P4 implements bsf, bsr, and btr, but I know how the Athlon
>>implements the first two. For bsf/bsr, I am told it uses a binary search
>>algorithm. I do not know the exact implementation, but they may do something
>>like x & -x to trap the LSB (for bsf) and then use conditional masks. The bsr
>>instruction would be similar, but a different method for trapping the MSB.
>>
>>I actually timed the algorithm Gerd posted. I don't recall exact numbers, but
>>there was very little overlap between the two bsf instructions. The total
>>execution time was at least 20 cycles. Both bsf instructions are listed as 8
>>cycles, and the btr is listed as 9 cycles. Another sequence of code was 1 cycle
>>faster and used 2 bsf instructions with a shift to reset the bit. The DirectPath
>>instructions used to emulate btr took approximately 5 cycles to execute.
>>
>>>>Register usage is a non-issue because VC assumes that all general registers are
>>>>trashed over inline assembly. I can use 1 register of 5 registers; it makes no
>>>>difference. Besides -- he and I both use 4 registers.
>>>>
>>>
>>>If you watch the VC-generated code with inlined asm, you will find register
>>>spill-over. The more register the inline asm uses, the more spill-over. VC does
>>>not have a way to name a variable register.
>>
>>No it doesn't, and that's a shame. However, at 3 registers you already have
>>spill-over. It really doesn't make a big difference...
>>
>>I don't remember what I ended up putting in lsb.h, but I know at one point the
>>routine used 6 registers but preserved 3.
>>
>>>>>Of course, if you go hand inlining Mat's code into your code (rewrite all the
>>>>>callers in asm), you will see the speedup. And you find even more pairing
>>>>>oppotunites:)
>>>>
>>>>Isn't that what the compiler is already doing?
>>>
>>>No. I mean the compiler does a bad job when it comes over inline asm. You can
>>>try it, I bet you wont like the code it generates. It generates some unnecessary
>>>memory operations to free registers.
>>>
>>>>
>>>>>Have fun
>>>>>
>>>>>dzhao
>>
>>Yes, but both cases were inline assembly...
>>
>
>Yes. You can only tell the difference by reading the asm outputs .
>
>>Throwing stuff out to memory isn't that bad. It can execute out-of-order, and it
>>should usually hit the L1 when referencing the stack. Reading it back in is the
>>slow part.
>>
>
>It is not very bad, but it is worse than you expected. Besides saving and
>loading registers from/to stack around the inlined asm code, the complier
>usually uses more registers for the caller function than necessary, thus more
>push/pop operations. And more register usage reduces the chances of register
>renaming as P4 manual suggested. I noticed several instances of much less
>optimized code with inlined asm from VC at function level, compared to function
>call to asm function. If the inlined asm resides in a loop, things get even
>worse. Sometimes a function call can even beat the inlined version.
Push is cheap on Athlon. It decodes DirectPath (3 per cycle). The updated value
of esp is available after 1 cycle, and the instruction retires after 2 cycles.
Due to register renaming, it has an effective latency of 1 cycle despite having
an actual latency of 2 cycles.
Pop isn't cheap. I noticed that a mov/add sequence is better than pop (which is
VectorPath decode), particularly when popping multiple items from the stack.
Making use of the AGU stalls the accesses, but it is still faster than multiple
pop instructions. VC does unfortunately use pop instructions rather than the
mov/add equivalent.
You are right about the inline assembly issues. However, with 40 internal rename
registers, I don't think register usage is a big issue. Again, the code he uses
and the code I posted both use 4 registers, so I don't think register usage is
causing the speed hit. Code size isn't either.
>>I personally like GCC's form which allows hints to be given to the compiler as
>>well as variable registers. I wish it were a little less idiosyncratic, though.
>>
>>-Matt
>
>I agree. I just wonder why VC compiler does not incorperate such hint, since it
>now requires more hints from programmers already, e.g. _assume, taken hint and
>the spin loop hint.
-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.