Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Saved Another Cycle -- Woohoo!

Author: Vincent Diepeveen

Date: 04:06:58 01/08/03

Go up one level in this thread


On January 06, 2003 at 13:12:24, Matt Taylor wrote:

Matt, i see you use 8 bits code mixed with 32 bits.

Isn't 8 bits code slower on P4?

>On January 05, 2003 at 22:52:06, Robert Hyatt wrote:
>
>>On January 05, 2003 at 17:06:56, Matt Taylor wrote:
>>
>>>>I was chugging away on the older binary search routine yesterday. I've modified
>>>>it to use a 5th register. This allows me to stagger the accumulation and make
>>>>use of extra cycles that were burning waiting on instructions to retire. It is
>>>>amazingly efficient -- considering the maximum of 3 instructions/cycle, there
>>>>are only 7 slots wasted across the entire routine (38 instructions). This means
>>>>optimum execution time for this sequence is 13 cycles, and I'm doing it in 15.
>>>
>>>Oops, did I say 15? I should correct myself. I am now doing it in 14. Only 4
>>>slots not utilized. I fixed a bug and gained performance. :-)
>>>
>>>-Matt
>>\
>>
>>If you like cute code, here is Eugene's FirstOne() function for
>>Crafty.  It is cute, but has one drawback in that it can not be inlined
>>due to the addressing computations he does which means that the value being
>>checked for FirstOne() has to be in memory...
>>
><snip>
>>
>>I think you will be amused if you work thru it.  I should add that Crafty's
>>"bit numbering" has the MSB (leftmost) as bit 0, LSB (rightmost) as bit 63,
>>which is how the Cray does this due to the fact that they have a "leadz"
>>instruction that counts leading (leftmost) zeroes rather than finding the
>>first one bit.  The Cray also returns 64 if there are no bits set.  Eugene's
>>code emulates this perfectly.
>>
><snip>
>
>I am actually interested more in your LastOne() routine. I looked at your code
>for each previously and had noted the reverse bit ordering. The ordering is
>arbitrary; they still could have used normal bit ordering (e.g. leadz used in
>place of bsr, not bsf), but it makes no difference anyway.
>
>Eugene's trick is very interesting, taking advantage of the fact that cmp is
>really sub without writing to the destination. I used a similar trick, though I
>use setz instead of sbb. The sbb likely works better on Intel processors. I
>think I will implement that.
>
>Unfortunately the branch, as you point out, can hurt. On a P6-core chip, the
>bsf/bsr instructions work great. On Athlon, Eugene's routine branches and yours
>has the disadvantage of using bsr twice (10 clocks at minimum for each). On the
>other hand, both are relatively small. The routine I'm working on is some ~40
>instructions and more than 100 code bytes.
>
>This is obviously written in Intel assembly. The routine it is based off of
>takes 14 clocks to execute on Athlon; I believe this takes an additional clock
>because of Crafty requirements. It has the same problem Eugene's routine does
>(with address computation), but considering x86 GPR restrictions, it is unlikely
>a routine like this -could- work with the bitboard in registers. The memory
>access comes free, and I believe one can coerce gcc to pass the bitboard in
>through registers if gcc desires to.
>
>Also FYI this modifies the bitboard to reset the bit it finds. If that were
>removed, this could be inlined with no memory access.
>
>			; Note, instructions dispatching in the same cycle are grouped.
>			; Note preservation of ebx/esi.
>			; Note: this routine tailored to Crafty's bit ordering!
>
>			push	esi
>			mov	esi, DWORD PTR [bb]
>			xor	eax, eax
>
>			mov	ecx, DWORD PTR [bb+4]
>			xor	ebx, ebx
>			test	esi, esi
>
>			; 2 cycles
>			cmovz	esi, ecx
>			push	ebx
>			setz	al
>			mov	ecx, esi
>			neg	esi
>
>			setnc	bl
>			adc	al, -1
>			and	esi, ecx
>
>			xor	ecx, ecx
>			add	bl, bl
>			xor	DWORD PTR [bb+eax*4], esi
>
>			or	al, bl
>			test	esi, 0x0000FFFF
>
>			setnz	bl
>			xor	edx, edx
>			test	esi, 0x00FF00FF
>
>			lea	ebx, [eax*2+ebx]
>			setnz	cl
>			test	esi, 0x0F0F0F0F
>
>			setnz	dl
>			shl	cl, 3
>			test	esi, 0x33333333
>
>			lea	edx, [edx*4+ecx]
>			setnz	al
>			test	esi, 0x55555555
>
>			setnz	cl
>			shl	bl, 4
>			add	al, al
>
>			or	al, cl
>			or	dl, bl
>			mov	ebx, [esp]
>
>			mov	esi, [esp+4]
>			or	al, dl
>			add	esp, 8
>
>			; Worst-case one additional clock stall from AGI. Should usually be hidden.
>
>-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.