Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Bitscan

Author: Matt Taylor
Date: 22:56:13 02/11/03
On February 11, 2003 at 14:49:37, Walter Faxon wrote:

>Yay, Matt!
>
>I'm probably not going to get a chance to closely examine your work again soon,
>but here's a little doc. problem:  "...converts it to an index counting up from
>0 to 31..." should of course be 0 to 63.

Good point. Fixed. I'll update the copy on the FTP. Once again, address is:
ftp://24.73.126.6/lsb.c
ftp://24.73.126.6/lsb.h

>Also, for the C routine after:
>
>    #ifdef __LSB_USE_C_INLINE
>    #define __LSB_USE_TABLE
>    static _INLINE _I32_TYPE BitSearchResetInternal(_LSB_I64 *bb)
>
>I don't see where you're resetting the bit.

Oops! I tested the 3 inline assembly versions, but I forgot that I'd even
included an inline C version. At your suggestion I've already reimplemented it
using the table-based version. (It was supposed to be that to begin with as you
will notice it defines __LSB_USE_TABLE.) The reimplementation correctly resets
the bit.

>Once you're satisfied with everything you should encourage chess programmers to
>experiment with your routines and give you some feedback.  For example, IIRC,
>Uri said that the table version was definitely faster than the bsf version when
>run in test mode, but slower when run inside his program.  He attributed this to
>his using a lot of loops with the former's larger code size greatly increasing
>his program's cache footprint.  (Maybe he somehow duplicated the table for each
>loop or routine??  That would lead to cache misses.)

One advantage to bsf is that it can be more readily inlined. I quickly poured
through the 3 GCC versions of the routine and came up with the following sizes
(approximations):

fast bsf - 27 bytes
slow bsf - 31 bytes
table    - 50 bytes

The table-based routine is a good 4-5 cycles faster than the "slow bsf" on
Athlon. However, if the compiler decides to generate a function call, that speed
will be completely negated with something like 8-10 cycles function call
overhead. There is a general rule for deciding when to inline (something like
less than x instructions or x bytes), but I can't remember what it is now. I
believe I read it in the K7 optimization manual.

Alternatively, a trade-off can be made. If you find a few particular hotspots,
you can use an inline table version for the hotspots and inline slow bsf (or
fast bsf on P6) in other places. Maybe in places that are infrequently hit (if
you are that lucky), don't use inline assembly at all. Maybe. Tweaking the
program to that degree is annoying, and one can only hope that profiling tools
will make these trade-offs for you.

>Also, you might consider promulgating your results to other sites on
>bit-fiddling tricks, e.g., Magic algorithms from The Aggregate
>(http://aggregate.org/MAGIC/).  They probably want pure C versions.

Probably, and I'm not being particularly clever at the C-level; I only used
suggestions you gave. Their tzc (Trailing Zero Count) is not particularly
efficient, but it isn't too bad either. The main problem is that they subtract
one and popcount. I did examine this technique when working on optimizing the
bitscan routines, but it was rather inefficient because the popcount the
popcount is inefficient (for this problem). The popcount can't make assumptions
about relative bit locations -- assumptions we -can- make since we know there
are i contiguous, right-aligned 1's in (x & -x) - 1, where i is the index of the
least-signficiant bit in x.

Your table solution is good, and it also seems to be compiler-friendly. Until I
made a few reductions, the code that VC 7 generated was faster than my own. The
binary search is also very good.

>Anyway, thanks again!  I'm sure your work will help a lot of people.  Well, a
>few dozen, anyway!
>
>-- Walter

I hope it was worth more than an exercise. :-)

I'm thinking of adding popcount straight from AMD's manual. I may in the future
add optimized versions of bitscan for other chips, and I may add other versions
(i.e. with no side-effects, zero bitboard allowed, etc.) If I manage to complete
anything else, I'll post about it here.

-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.