Author: Vincent Diepeveen
Date: 04:06:58 01/08/03
Go up one level in this thread
On January 06, 2003 at 13:12:24, Matt Taylor wrote: Matt, i see you use 8 bits code mixed with 32 bits. Isn't 8 bits code slower on P4? >On January 05, 2003 at 22:52:06, Robert Hyatt wrote: > >>On January 05, 2003 at 17:06:56, Matt Taylor wrote: >> >>>>I was chugging away on the older binary search routine yesterday. I've modified >>>>it to use a 5th register. This allows me to stagger the accumulation and make >>>>use of extra cycles that were burning waiting on instructions to retire. It is >>>>amazingly efficient -- considering the maximum of 3 instructions/cycle, there >>>>are only 7 slots wasted across the entire routine (38 instructions). This means >>>>optimum execution time for this sequence is 13 cycles, and I'm doing it in 15. >>> >>>Oops, did I say 15? I should correct myself. I am now doing it in 14. Only 4 >>>slots not utilized. I fixed a bug and gained performance. :-) >>> >>>-Matt >>\ >> >>If you like cute code, here is Eugene's FirstOne() function for >>Crafty. It is cute, but has one drawback in that it can not be inlined >>due to the addressing computations he does which means that the value being >>checked for FirstOne() has to be in memory... >> ><snip> >> >>I think you will be amused if you work thru it. I should add that Crafty's >>"bit numbering" has the MSB (leftmost) as bit 0, LSB (rightmost) as bit 63, >>which is how the Cray does this due to the fact that they have a "leadz" >>instruction that counts leading (leftmost) zeroes rather than finding the >>first one bit. The Cray also returns 64 if there are no bits set. Eugene's >>code emulates this perfectly. >> ><snip> > >I am actually interested more in your LastOne() routine. I looked at your code >for each previously and had noted the reverse bit ordering. The ordering is >arbitrary; they still could have used normal bit ordering (e.g. leadz used in >place of bsr, not bsf), but it makes no difference anyway. > >Eugene's trick is very interesting, taking advantage of the fact that cmp is >really sub without writing to the destination. I used a similar trick, though I >use setz instead of sbb. The sbb likely works better on Intel processors. I >think I will implement that. > >Unfortunately the branch, as you point out, can hurt. On a P6-core chip, the >bsf/bsr instructions work great. On Athlon, Eugene's routine branches and yours >has the disadvantage of using bsr twice (10 clocks at minimum for each). On the >other hand, both are relatively small. The routine I'm working on is some ~40 >instructions and more than 100 code bytes. > >This is obviously written in Intel assembly. The routine it is based off of >takes 14 clocks to execute on Athlon; I believe this takes an additional clock >because of Crafty requirements. It has the same problem Eugene's routine does >(with address computation), but considering x86 GPR restrictions, it is unlikely >a routine like this -could- work with the bitboard in registers. The memory >access comes free, and I believe one can coerce gcc to pass the bitboard in >through registers if gcc desires to. > >Also FYI this modifies the bitboard to reset the bit it finds. If that were >removed, this could be inlined with no memory access. > > ; Note, instructions dispatching in the same cycle are grouped. > ; Note preservation of ebx/esi. > ; Note: this routine tailored to Crafty's bit ordering! > > push esi > mov esi, DWORD PTR [bb] > xor eax, eax > > mov ecx, DWORD PTR [bb+4] > xor ebx, ebx > test esi, esi > > ; 2 cycles > cmovz esi, ecx > push ebx > setz al > mov ecx, esi > neg esi > > setnc bl > adc al, -1 > and esi, ecx > > xor ecx, ecx > add bl, bl > xor DWORD PTR [bb+eax*4], esi > > or al, bl > test esi, 0x0000FFFF > > setnz bl > xor edx, edx > test esi, 0x00FF00FF > > lea ebx, [eax*2+ebx] > setnz cl > test esi, 0x0F0F0F0F > > setnz dl > shl cl, 3 > test esi, 0x33333333 > > lea edx, [edx*4+ecx] > setnz al > test esi, 0x55555555 > > setnz cl > shl bl, 4 > add al, al > > or al, cl > or dl, bl > mov ebx, [esp] > > mov esi, [esp+4] > or al, dl > add esp, 8 > > ; Worst-case one additional clock stall from AGI. Should usually be hidden. > >-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.