Author: Robert Hyatt
Date: 11:17:17 01/06/03
Go up one level in this thread
On January 06, 2003 at 13:12:24, Matt Taylor wrote:
>On January 05, 2003 at 22:52:06, Robert Hyatt wrote:
>
>>On January 05, 2003 at 17:06:56, Matt Taylor wrote:
>>
>>>>I was chugging away on the older binary search routine yesterday. I've modified
>>>>it to use a 5th register. This allows me to stagger the accumulation and make
>>>>use of extra cycles that were burning waiting on instructions to retire. It is
>>>>amazingly efficient -- considering the maximum of 3 instructions/cycle, there
>>>>are only 7 slots wasted across the entire routine (38 instructions). This means
>>>>optimum execution time for this sequence is 13 cycles, and I'm doing it in 15.
>>>
>>>Oops, did I say 15? I should correct myself. I am now doing it in 14. Only 4
>>>slots not utilized. I fixed a bug and gained performance. :-)
>>>
>>>-Matt
>>\
>>
>>If you like cute code, here is Eugene's FirstOne() function for
>>Crafty. It is cute, but has one drawback in that it can not be inlined
>>due to the addressing computations he does which means that the value being
>>checked for FirstOne() has to be in memory...
>>
><snip>
>>
>>I think you will be amused if you work thru it. I should add that Crafty's
>>"bit numbering" has the MSB (leftmost) as bit 0, LSB (rightmost) as bit 63,
>>which is how the Cray does this due to the fact that they have a "leadz"
>>instruction that counts leading (leftmost) zeroes rather than finding the
>>first one bit. The Cray also returns 64 if there are no bits set. Eugene's
>>code emulates this perfectly.
>>
><snip>
>
>I am actually interested more in your LastOne() routine. I looked at your code
>for each previously and had noted the reverse bit ordering. The ordering is
>arbitrary; they still could have used normal bit ordering (e.g. leadz used in
>place of bsr, not bsf), but it makes no difference anyway.
>
>Eugene's trick is very interesting, taking advantage of the fact that cmp is
>really sub without writing to the destination. I used a similar trick, though I
>use setz instead of sbb. The sbb likely works better on Intel processors. I
>think I will implement that.
>
>Unfortunately the branch, as you point out, can hurt. On a P6-core chip, the
>bsf/bsr instructions work great. On Athlon, Eugene's routine branches and yours
>has the disadvantage of using bsr twice (10 clocks at minimum for each). On the
>other hand, both are relatively small. The routine I'm working on is some ~40
>instructions and more than 100 code bytes.
>
>This is obviously written in Intel assembly. The routine it is based off of
>takes 14 clocks to execute on Athlon; I believe this takes an additional clock
>because of Crafty requirements. It has the same problem Eugene's routine does
>(with address computation), but considering x86 GPR restrictions, it is unlikely
>a routine like this -could- work with the bitboard in registers. The memory
>access comes free, and I believe one can coerce gcc to pass the bitboard in
>through registers if gcc desires to.
>
>Also FYI this modifies the bitboard to reset the bit it finds. If that were
>removed, this could be inlined with no memory access.
>
> ; Note, instructions dispatching in the same cycle are grouped.
> ; Note preservation of ebx/esi.
> ; Note: this routine tailored to Crafty's bit ordering!
>
> push esi
> mov esi, DWORD PTR [bb]
> xor eax, eax
>
> mov ecx, DWORD PTR [bb+4]
> xor ebx, ebx
> test esi, esi
>
> ; 2 cycles
> cmovz esi, ecx
> push ebx
> setz al
> mov ecx, esi
> neg esi
>
> setnc bl
> adc al, -1
> and esi, ecx
>
> xor ecx, ecx
> add bl, bl
> xor DWORD PTR [bb+eax*4], esi
>
> or al, bl
> test esi, 0x0000FFFF
>
> setnz bl
> xor edx, edx
> test esi, 0x00FF00FF
>
> lea ebx, [eax*2+ebx]
> setnz cl
> test esi, 0x0F0F0F0F
>
> setnz dl
> shl cl, 3
> test esi, 0x33333333
>
> lea edx, [edx*4+ecx]
> setnz al
> test esi, 0x55555555
>
> setnz cl
> shl bl, 4
> add al, al
>
> or al, cl
> or dl, bl
> mov ebx, [esp]
>
> mov esi, [esp+4]
> or al, dl
> add esp, 8
>
> ; Worst-case one additional clock stall from AGI. Should usually be hidden.
>
>-Matt
I will try to study your code after class today. Meanwhile, here is my
first-cut at
LastOne() for Crafty using the same approach I used in the FirstOne() code, with
zero
branches, but doing two bit scan instructions which sounds not-so-good for
Athlons:
_LastOne:
movl $-33, %ecx
bsf 8(%esp),%eax
cmovz %ecx, %eax
addl $32,%eax
bsf 4(%esp), %ecx
cmovz %eax, %ecx
movl $63, %eax
subl %ecx, %eax
ret
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.