Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Bitboard Breakthroughs!

Author: Gerd Isenberg

Date: 05:07:31 01/24/03

Go up one level in this thread


On January 24, 2003 at 06:02:13, Matt Taylor wrote:

>Ah, I was -just- about to give up and post my results when I made another
>breakthrough. The routine I was optimizing went from 14->12 cycles! Very
>exciting.
>
>As it turns out, I decided -not- to fold. I manage to scrape together some
>cycles and populate spare registers with three of the constants for the search
>(0xFF00FF00, 0xF0F0F0F0, and 0xCCCCCCCC). Bit 5 is stashed away in the
>accumulator, and bit 4 I get through a comparison to see if the bit is in the
>upper half of the folded 32-bit (from 64-bit) value. Bits 1, 2, and 3 come from
>the technique I will demonstrate below. The least significant bit is similar,
>but I don't need an extra register for it.
>
>The first version of this function unrolled the following:
>test esi, mask  ; esi = folded value
>setnz reg8      ; reg8 = (esi & mask != 0 ? 1 : 0)
>shl accum, 1    ; accum <<= 1
>add accum, reg8 ; accum += reg8
>
>New version does this:
>and reg32, esi   ; Assume above + reg32 populated with mask
>add reg32, -1    ; cf = (reg32 != 0 ? 1 : 0)
>adc accum, accum ; accum = (accum << 1) + cf
>
>3 instructions per cycle is the maximum number of instructions per cycle -any-
>x86 processor can dispatch. (You can technically achieve more on P6/K7 when you
>encounter dispatch stalls without decode stalls.) Perfect!
>
>Theoretical (ignoring setup costs) runtime is 10 clocks, 9% faster than Walter's
>table lookup routine in optimized assembly. Current execution time on Athlon is
>12 clocks. That's pretty good.
>
>After Gerd's comment about the routine being inlined just about everywhere, I
>figure I will put up the optimized bsf routine (no data, small length, slow),
>optimized binary search (no data, long length, faster), and optimized table
>lookup (small data, medium length, fastest).
>
>I intend also to post routines in both gcc inline assembly and Intel/VC inline
>assembly form for easiest incorporation.
>
>I have tenative plans to work on a 64-bit popcount, too. AMD has published a
>32-bit popcount and a 64-bit popcount, but I think I can double up their 32-bit
>popcount and achieve better performance in 64-bit.
>
>-Matt

Hi Matt,

Have some problems to follow your thoughts on the fly, takes some time.
As far as i see, you do a "special" popcount for trailing "ones" after
bit = (bb & -bb) - 1, to get the index of the LSB. Looking foreward to your
routines ;-)

Gerd




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.