Author: Dezhi Zhao
Date: 12:19:12 02/17/03
Go up one level in this thread
On February 17, 2003 at 14:44:44, Matt Taylor wrote:
>On February 17, 2003 at 13:34:11, Dezhi Zhao wrote:
>
>>On February 17, 2003 at 03:58:13, Matt Taylor wrote:
>>
>>>On February 16, 2003 at 22:05:38, Dezhi Zhao wrote:
>>>
>>>>On February 16, 2003 at 20:08:00, Matt Taylor wrote:
>>>>
>>>>>On February 16, 2003 at 15:55:37, Dezhi Zhao wrote:
>>>>>
>>>>>>On February 16, 2003 at 07:37:31, Gerd Isenberg wrote:
>>>>>>
>>>>>>>On February 16, 2003 at 05:14:41, Matt Taylor wrote:
>>>>>>>
>>>>>>>>On February 14, 2003 at 23:52:09, Matt Taylor wrote:
>>>>>>>>
>>>>>>>>>On February 14, 2003 at 14:31:42, Gerd Isenberg wrote:
>>>>>>>>>
>>>>>>>>>>I tried two of your routines so far, the cmov/single-bsf one and the
>>>>>>>>>>LSB_32_table lookup one. It's a mess, both without any branches, but IsiChess is
>>>>>>>>>>stil 2-4% slower with these. I guess some chaotic code size dependent side
>>>>>>>>>>effects (Athlon Xp 2000+, 512MB SDRAM).
>>>>>>>>>>
>>>>>>>>>>Regards,
>>>>>>>>>>Gerd
>>>>>>>>
>>>>>>>>What -did- you use before? After thinking about it, I don't see how these
>>>>>>>>routines could make such an impact. It would be very difficult to shrink them
>>>>>>>>further.
>>>>>>>>
>>>>>>>>Have you tried not inlining them? That would make the resulting code smaller.
>>>>>>>>
>>>>>>>>-Matt
>>>>>>>
>>>>>>>Hi Matt,
>>>>>>>
>>>>>>>this one (posted already x times) is the routine i compared with:
>>>>>>>
>>>>>>>__forceinline
>>>>>>>UINT BitSearchAndReset(BitBoard &bb)
>>>>>>>{
>>>>>>> __asm
>>>>>>> {
>>>>>>> mov ebx, [bb]
>>>>>>> xor eax, eax
>>>>>>> mov edx, 1
>>>>>>> bsf ecx, [ebx]
>>>>>>> jnz found
>>>>>>> bsf ecx, [ebx + 4]
>>>>>>> lea ebx, [ebx + 4]
>>>>>>> xor eax, 32
>>>>>>> found:
>>>>>>> shl edx, cl
>>>>>>> xor eax, ecx
>>>>>>> xor [ebx], edx
>>>>>>> }
>>>>>>>
>>>>>>>}
>>>>>>>
>>>>>>>To my surprise this conditional routine using four registers is faster than this
>>>>>>>short unconditional one but using three vector path instructions:
>>>>>>>
>>>>>>>__forceinline
>>>>>>>UINT BitSearchAndReset(BitBoard &bb)
>>>>>>>{
>>>>>>> __asm
>>>>>>> {
>>>>>>> mov edx, [bb]
>>>>>>> bsf eax, [edx+4]
>>>>>>> xor eax, 32
>>>>>>> bsf eax, [edx]
>>>>>>> btr [edx],eax
>>>>>>> }
>>>>>>>}
>>>>>>>
>>>>>>>With none inlining, your routines and even walters magic lsb64 lookup routine
>>>>>>>are all about the same in average, but the nps and solution time performance is
>>>>>>>significantly worse (4-10%).
>>>>>>>
>>>>>>>Regards,
>>>>>>>Gerd
>>>>>>
>>>>>>This sounds quite nomral to me. If you choose to inline them, you shorter
>>>>>>version will be faster. Matt's functions are optimal in context of functions.
>>>>>>If you choose to inline them, you have to factor in register usages and out of
>>>>>>order execuation. Your shoter version uses fewer registers and the processor can
>>>>>>often find OOE oppotunities to hide the 2nd bsf instruction.
>>>>>
>>>>>First let me say that we are specifically talking about Athlon.
>>>>>
>>>>>On Athlon, VectorPath instructions (like bsf) decode into a series of macro-ops.
>>>>>OOOE cannot hide the 8 cycles of latency this produces because the microcoded
>>>>>bsf instruction consumes most (all?) of the execution resources. At times I have
>>>>>noted the bsf instruction taking only 7 cycles. Perhaps this is an effect of
>>>>>OOOE.
>>>>>
>>>>I do not have an Athlon. However I assume it implement bsf, bsr, btr the same as
>>>>P4. Yes, it is dfficult to hide an instruction of 8 cycle latency. You can
>>>>manage to hide part of it sometimes, if you have another slow instruction in the
>>>>stream, e.g. a shift operation.
>>>
>>>Athlon uses a barrel shifter. Latency of 1 clock. I may be mistaken, but I
>>>believe it has 1 attached to each of its 3 IEUs. Throughput of 3 shifts per
>>>clock.
>>>
>>>I don't know how the P4 implements bsf, bsr, and btr, but I know how the Athlon
>>>implements the first two. For bsf/bsr, I am told it uses a binary search
>>>algorithm. I do not know the exact implementation, but they may do something
>>>like x & -x to trap the LSB (for bsf) and then use conditional masks. The bsr
>>>instruction would be similar, but a different method for trapping the MSB.
>>>
>>>I actually timed the algorithm Gerd posted. I don't recall exact numbers, but
>>>there was very little overlap between the two bsf instructions. The total
>>>execution time was at least 20 cycles. Both bsf instructions are listed as 8
>>>cycles, and the btr is listed as 9 cycles. Another sequence of code was 1 cycle
>>>faster and used 2 bsf instructions with a shift to reset the bit. The DirectPath
>>>instructions used to emulate btr took approximately 5 cycles to execute.
>>>
>>>>>Register usage is a non-issue because VC assumes that all general registers are
>>>>>trashed over inline assembly. I can use 1 register of 5 registers; it makes no
>>>>>difference. Besides -- he and I both use 4 registers.
>>>>>
>>>>
>>>>If you watch the VC-generated code with inlined asm, you will find register
>>>>spill-over. The more register the inline asm uses, the more spill-over. VC does
>>>>not have a way to name a variable register.
>>>
>>>No it doesn't, and that's a shame. However, at 3 registers you already have
>>>spill-over. It really doesn't make a big difference...
>>>
>>>I don't remember what I ended up putting in lsb.h, but I know at one point the
>>>routine used 6 registers but preserved 3.
>>>
>>>>>>Of course, if you go hand inlining Mat's code into your code (rewrite all the
>>>>>>callers in asm), you will see the speedup. And you find even more pairing
>>>>>>oppotunites:)
>>>>>
>>>>>Isn't that what the compiler is already doing?
>>>>
>>>>No. I mean the compiler does a bad job when it comes over inline asm. You can
>>>>try it, I bet you wont like the code it generates. It generates some unnecessary
>>>>memory operations to free registers.
>>>>
>>>>>
>>>>>>Have fun
>>>>>>
>>>>>>dzhao
>>>
>>>Yes, but both cases were inline assembly...
>>>
>>
>>Yes. You can only tell the difference by reading the asm outputs .
>>
>>>Throwing stuff out to memory isn't that bad. It can execute out-of-order, and it
>>>should usually hit the L1 when referencing the stack. Reading it back in is the
>>>slow part.
>>>
>>
>>It is not very bad, but it is worse than you expected. Besides saving and
>>loading registers from/to stack around the inlined asm code, the complier
>>usually uses more registers for the caller function than necessary, thus more
>>push/pop operations. And more register usage reduces the chances of register
>>renaming as P4 manual suggested. I noticed several instances of much less
>>optimized code with inlined asm from VC at function level, compared to function
>>call to asm function. If the inlined asm resides in a loop, things get even
>>worse. Sometimes a function call can even beat the inlined version.
>
>Push is cheap on Athlon. It decodes DirectPath (3 per cycle). The updated value
>of esp is available after 1 cycle, and the instruction retires after 2 cycles.
>Due to register renaming, it has an effective latency of 1 cycle despite having
>an actual latency of 2 cycles.
>
>Pop isn't cheap. I noticed that a mov/add sequence is better than pop (which is
>VectorPath decode), particularly when popping multiple items from the stack.
>Making use of the AGU stalls the accesses, but it is still faster than multiple
>pop instructions. VC does unfortunately use pop instructions rather than the
>mov/add equivalent.
>
I have not paid much attention to pop, because it has a latency of 1.5 and
throughput of 1 for P4, exactly the same as push. The only complaint I have
about the compiler generated header and tail code is that it often generates
consecutive push/pop sequence without interleaving other write/read
instructions.
>You are right about the inline assembly issues. However, with 40 internal rename
>registers, I don't think register usage is a big issue. Again, the code he uses
>and the code I posted both use 4 registers, so I don't think register usage is
>causing the speed hit. Code size isn't either.
>
I just read the 2 versions again and found the short one has a btr instruction.
I dont have exact timing for btr/bts. However I think it is as slow as a bsf.
Here it uses a mem operand. So it maybe even worse than a bsf reg, reg. This is
likely the cause. If the reset is done by using xor, I bet you will find some
speedup.
>>>I personally like GCC's form which allows hints to be given to the compiler as
>>>well as variable registers. I wish it were a little less idiosyncratic, though.
>>>
>>>-Matt
>>
>>I agree. I just wonder why VC compiler does not incorperate such hint, since it
>>now requires more hints from programmers already, e.g. _assume, taken hint and
>>the spin loop hint.
>
>-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.