Author: Matt Taylor
Date: 13:22:16 01/06/03
Go up one level in this thread
On January 06, 2003 at 14:17:17, Robert Hyatt wrote: >On January 06, 2003 at 13:12:24, Matt Taylor wrote: > >>On January 05, 2003 at 22:52:06, Robert Hyatt wrote: >> >>>On January 05, 2003 at 17:06:56, Matt Taylor wrote: >>> >>>>>I was chugging away on the older binary search routine yesterday. I've modified >>>>>it to use a 5th register. This allows me to stagger the accumulation and make >>>>>use of extra cycles that were burning waiting on instructions to retire. It is >>>>>amazingly efficient -- considering the maximum of 3 instructions/cycle, there >>>>>are only 7 slots wasted across the entire routine (38 instructions). This means >>>>>optimum execution time for this sequence is 13 cycles, and I'm doing it in 15. >>>> >>>>Oops, did I say 15? I should correct myself. I am now doing it in 14. Only 4 >>>>slots not utilized. I fixed a bug and gained performance. :-) >>>> >>>>-Matt >>>\ >>> >>>If you like cute code, here is Eugene's FirstOne() function for >>>Crafty. It is cute, but has one drawback in that it can not be inlined >>>due to the addressing computations he does which means that the value being >>>checked for FirstOne() has to be in memory... >>> >><snip> >>> >>>I think you will be amused if you work thru it. I should add that Crafty's >>>"bit numbering" has the MSB (leftmost) as bit 0, LSB (rightmost) as bit 63, >>>which is how the Cray does this due to the fact that they have a "leadz" >>>instruction that counts leading (leftmost) zeroes rather than finding the >>>first one bit. The Cray also returns 64 if there are no bits set. Eugene's >>>code emulates this perfectly. >>> >><snip> >> >>I am actually interested more in your LastOne() routine. I looked at your code >>for each previously and had noted the reverse bit ordering. The ordering is >>arbitrary; they still could have used normal bit ordering (e.g. leadz used in >>place of bsr, not bsf), but it makes no difference anyway. >> >>Eugene's trick is very interesting, taking advantage of the fact that cmp is >>really sub without writing to the destination. I used a similar trick, though I >>use setz instead of sbb. The sbb likely works better on Intel processors. I >>think I will implement that. >> >>Unfortunately the branch, as you point out, can hurt. On a P6-core chip, the >>bsf/bsr instructions work great. On Athlon, Eugene's routine branches and yours >>has the disadvantage of using bsr twice (10 clocks at minimum for each). On the >>other hand, both are relatively small. The routine I'm working on is some ~40 >>instructions and more than 100 code bytes. >> >>This is obviously written in Intel assembly. The routine it is based off of >>takes 14 clocks to execute on Athlon; I believe this takes an additional clock >>because of Crafty requirements. It has the same problem Eugene's routine does >>(with address computation), but considering x86 GPR restrictions, it is unlikely >>a routine like this -could- work with the bitboard in registers. The memory >>access comes free, and I believe one can coerce gcc to pass the bitboard in >>through registers if gcc desires to. >> >>Also FYI this modifies the bitboard to reset the bit it finds. If that were >>removed, this could be inlined with no memory access. <snip code> >> >>-Matt > > >I will try to study your code after class today. Meanwhile, here is my >first-cut at >LastOne() for Crafty using the same approach I used in the FirstOne() code, with >zero >branches, but doing two bit scan instructions which sounds not-so-good for >Athlons: > <snip code> The reason why this routine is FASTER than the microcode is because Athlon is going to have to do 2 32-bit bitscans. This routine folds 64-bits into 32-bits and then does 1 32-bit scan. The same principle should work for bsf, but I'm still working with some other variants of this routine with a target of 11-12 cycles for the full 64-bit search. Most likely Pentium 4 uses a similar approach. The P6 is the only processor I know of with a fast bsf/bsr implementation. Intel used a priority encoder. On K6/P5 and previous processors, bsf/bsr are implemented as best as I can tell with a loop. I don't know that studying this routine will accomplish much. The original routine suggested by Walter Faxon is a binary search like Athlon's microcode implementation. However, it's so convoluded at this point that it would be difficult to read. I'll assist by offering a brief description of some of the things I do. This routine does the equivalent of ((bb & mask) != 0) in C. It accumulates those bits to form the index at the end. I use the lea instruction for the obvious reason -- shift, add, and assign to another register. The unfortunate drawback is that lea requires 2 cycles to execute. At the beginning of this function, I mask out all but the lowest bit with the bb & -bb trick. I put bit 5 of the index into al and fold to do a 32-bit search. Bit 4 goes into bl, bit 3 into cl, and bit 2 into dl. I accumulate into ebx and edx, and then I store bits 1 and 0 in al and cl respectively. The accumulation operations are a little confusing. Each boolean comparison gives me 1 bit of the index, but the result is in the LSB of the operand. It needs to be shifted into place before being or'd with the other bits to form the index. The first accumulate puts index bits 4 & 5 into bl without shifting them to form a mask. The second accumulate moves cl into place and forms a mask in dl. One other comment is noteworthy for this function. The push instruction is DirectPath (up to 3/cycle) while the pop instruction is VectorPath (1/cycle, blocks resources). For this reason, I load ebx/esi from the stack manually instead of using pop. The address [esp+4] is considered complex (requires AGU), so it takes 1 cycle beyond the end of this code. This represents a possible AGI if code directly afterward utilizes memory. -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.