Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Saved Another Cycle -- Woohoo!

Author: Matt Taylor

Date: 10:12:24 01/06/03

Go up one level in this thread


On January 05, 2003 at 22:52:06, Robert Hyatt wrote:

>On January 05, 2003 at 17:06:56, Matt Taylor wrote:
>
>>>I was chugging away on the older binary search routine yesterday. I've modified
>>>it to use a 5th register. This allows me to stagger the accumulation and make
>>>use of extra cycles that were burning waiting on instructions to retire. It is
>>>amazingly efficient -- considering the maximum of 3 instructions/cycle, there
>>>are only 7 slots wasted across the entire routine (38 instructions). This means
>>>optimum execution time for this sequence is 13 cycles, and I'm doing it in 15.
>>
>>Oops, did I say 15? I should correct myself. I am now doing it in 14. Only 4
>>slots not utilized. I fixed a bug and gained performance. :-)
>>
>>-Matt
>\
>
>If you like cute code, here is Eugene's FirstOne() function for
>Crafty.  It is cute, but has one drawback in that it can not be inlined
>due to the addressing computations he does which means that the value being
>checked for FirstOne() has to be in memory...
>
<snip>
>
>I think you will be amused if you work thru it.  I should add that Crafty's
>"bit numbering" has the MSB (leftmost) as bit 0, LSB (rightmost) as bit 63,
>which is how the Cray does this due to the fact that they have a "leadz"
>instruction that counts leading (leftmost) zeroes rather than finding the
>first one bit.  The Cray also returns 64 if there are no bits set.  Eugene's
>code emulates this perfectly.
>
<snip>

I am actually interested more in your LastOne() routine. I looked at your code
for each previously and had noted the reverse bit ordering. The ordering is
arbitrary; they still could have used normal bit ordering (e.g. leadz used in
place of bsr, not bsf), but it makes no difference anyway.

Eugene's trick is very interesting, taking advantage of the fact that cmp is
really sub without writing to the destination. I used a similar trick, though I
use setz instead of sbb. The sbb likely works better on Intel processors. I
think I will implement that.

Unfortunately the branch, as you point out, can hurt. On a P6-core chip, the
bsf/bsr instructions work great. On Athlon, Eugene's routine branches and yours
has the disadvantage of using bsr twice (10 clocks at minimum for each). On the
other hand, both are relatively small. The routine I'm working on is some ~40
instructions and more than 100 code bytes.

This is obviously written in Intel assembly. The routine it is based off of
takes 14 clocks to execute on Athlon; I believe this takes an additional clock
because of Crafty requirements. It has the same problem Eugene's routine does
(with address computation), but considering x86 GPR restrictions, it is unlikely
a routine like this -could- work with the bitboard in registers. The memory
access comes free, and I believe one can coerce gcc to pass the bitboard in
through registers if gcc desires to.

Also FYI this modifies the bitboard to reset the bit it finds. If that were
removed, this could be inlined with no memory access.

			; Note, instructions dispatching in the same cycle are grouped.
			; Note preservation of ebx/esi.
			; Note: this routine tailored to Crafty's bit ordering!

			push	esi
			mov	esi, DWORD PTR [bb]
			xor	eax, eax

			mov	ecx, DWORD PTR [bb+4]
			xor	ebx, ebx
			test	esi, esi

			; 2 cycles
			cmovz	esi, ecx
			push	ebx
			setz	al
			mov	ecx, esi
			neg	esi

			setnc	bl
			adc	al, -1
			and	esi, ecx

			xor	ecx, ecx
			add	bl, bl
			xor	DWORD PTR [bb+eax*4], esi

			or	al, bl
			test	esi, 0x0000FFFF

			setnz	bl
			xor	edx, edx
			test	esi, 0x00FF00FF

			lea	ebx, [eax*2+ebx]
			setnz	cl
			test	esi, 0x0F0F0F0F

			setnz	dl
			shl	cl, 3
			test	esi, 0x33333333

			lea	edx, [edx*4+ecx]
			setnz	al
			test	esi, 0x55555555

			setnz	cl
			shl	bl, 4
			add	al, al

			or	al, cl
			or	dl, bl
			mov	ebx, [esp]

			mov	esi, [esp+4]
			or	al, dl
			add	esp, 8

			; Worst-case one additional clock stall from AGI. Should usually be hidden.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.