Author: Matt Taylor
Date: 10:12:24 01/06/03
Go up one level in this thread
On January 05, 2003 at 22:52:06, Robert Hyatt wrote: >On January 05, 2003 at 17:06:56, Matt Taylor wrote: > >>>I was chugging away on the older binary search routine yesterday. I've modified >>>it to use a 5th register. This allows me to stagger the accumulation and make >>>use of extra cycles that were burning waiting on instructions to retire. It is >>>amazingly efficient -- considering the maximum of 3 instructions/cycle, there >>>are only 7 slots wasted across the entire routine (38 instructions). This means >>>optimum execution time for this sequence is 13 cycles, and I'm doing it in 15. >> >>Oops, did I say 15? I should correct myself. I am now doing it in 14. Only 4 >>slots not utilized. I fixed a bug and gained performance. :-) >> >>-Matt >\ > >If you like cute code, here is Eugene's FirstOne() function for >Crafty. It is cute, but has one drawback in that it can not be inlined >due to the addressing computations he does which means that the value being >checked for FirstOne() has to be in memory... > <snip> > >I think you will be amused if you work thru it. I should add that Crafty's >"bit numbering" has the MSB (leftmost) as bit 0, LSB (rightmost) as bit 63, >which is how the Cray does this due to the fact that they have a "leadz" >instruction that counts leading (leftmost) zeroes rather than finding the >first one bit. The Cray also returns 64 if there are no bits set. Eugene's >code emulates this perfectly. > <snip> I am actually interested more in your LastOne() routine. I looked at your code for each previously and had noted the reverse bit ordering. The ordering is arbitrary; they still could have used normal bit ordering (e.g. leadz used in place of bsr, not bsf), but it makes no difference anyway. Eugene's trick is very interesting, taking advantage of the fact that cmp is really sub without writing to the destination. I used a similar trick, though I use setz instead of sbb. The sbb likely works better on Intel processors. I think I will implement that. Unfortunately the branch, as you point out, can hurt. On a P6-core chip, the bsf/bsr instructions work great. On Athlon, Eugene's routine branches and yours has the disadvantage of using bsr twice (10 clocks at minimum for each). On the other hand, both are relatively small. The routine I'm working on is some ~40 instructions and more than 100 code bytes. This is obviously written in Intel assembly. The routine it is based off of takes 14 clocks to execute on Athlon; I believe this takes an additional clock because of Crafty requirements. It has the same problem Eugene's routine does (with address computation), but considering x86 GPR restrictions, it is unlikely a routine like this -could- work with the bitboard in registers. The memory access comes free, and I believe one can coerce gcc to pass the bitboard in through registers if gcc desires to. Also FYI this modifies the bitboard to reset the bit it finds. If that were removed, this could be inlined with no memory access. ; Note, instructions dispatching in the same cycle are grouped. ; Note preservation of ebx/esi. ; Note: this routine tailored to Crafty's bit ordering! push esi mov esi, DWORD PTR [bb] xor eax, eax mov ecx, DWORD PTR [bb+4] xor ebx, ebx test esi, esi ; 2 cycles cmovz esi, ecx push ebx setz al mov ecx, esi neg esi setnc bl adc al, -1 and esi, ecx xor ecx, ecx add bl, bl xor DWORD PTR [bb+eax*4], esi or al, bl test esi, 0x0000FFFF setnz bl xor edx, edx test esi, 0x00FF00FF lea ebx, [eax*2+ebx] setnz cl test esi, 0x0F0F0F0F setnz dl shl cl, 3 test esi, 0x33333333 lea edx, [edx*4+ecx] setnz al test esi, 0x55555555 setnz cl shl bl, 4 add al, al or al, cl or dl, bl mov ebx, [esp] mov esi, [esp+4] or al, dl add esp, 8 ; Worst-case one additional clock stall from AGI. Should usually be hidden. -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.