Computer Chess Club Archives


Search

Terms

Messages

Subject: Bitboard Breakthroughs!

Author: Matt Taylor

Date: 03:02:13 01/24/03


Ah, I was -just- about to give up and post my results when I made another
breakthrough. The routine I was optimizing went from 14->12 cycles! Very
exciting.

As it turns out, I decided -not- to fold. I manage to scrape together some
cycles and populate spare registers with three of the constants for the search
(0xFF00FF00, 0xF0F0F0F0, and 0xCCCCCCCC). Bit 5 is stashed away in the
accumulator, and bit 4 I get through a comparison to see if the bit is in the
upper half of the folded 32-bit (from 64-bit) value. Bits 1, 2, and 3 come from
the technique I will demonstrate below. The least significant bit is similar,
but I don't need an extra register for it.

The first version of this function unrolled the following:
test esi, mask  ; esi = folded value
setnz reg8      ; reg8 = (esi & mask != 0 ? 1 : 0)
shl accum, 1    ; accum <<= 1
add accum, reg8 ; accum += reg8

New version does this:
and reg32, esi   ; Assume above + reg32 populated with mask
add reg32, -1    ; cf = (reg32 != 0 ? 1 : 0)
adc accum, accum ; accum = (accum << 1) + cf

3 instructions per cycle is the maximum number of instructions per cycle -any-
x86 processor can dispatch. (You can technically achieve more on P6/K7 when you
encounter dispatch stalls without decode stalls.) Perfect!

Theoretical (ignoring setup costs) runtime is 10 clocks, 9% faster than Walter's
table lookup routine in optimized assembly. Current execution time on Athlon is
12 clocks. That's pretty good.

After Gerd's comment about the routine being inlined just about everywhere, I
figure I will put up the optimized bsf routine (no data, small length, slow),
optimized binary search (no data, long length, faster), and optimized table
lookup (small data, medium length, fastest).

I intend also to post routines in both gcc inline assembly and Intel/VC inline
assembly form for easiest incorporation.

I have tenative plans to work on a 64-bit popcount, too. AMD has published a
32-bit popcount and a 64-bit popcount, but I think I can double up their 32-bit
popcount and achieve better performance in 64-bit.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.