Author: Gerd Isenberg
Date: 05:07:31 01/24/03
Go up one level in this thread
On January 24, 2003 at 06:02:13, Matt Taylor wrote: >Ah, I was -just- about to give up and post my results when I made another >breakthrough. The routine I was optimizing went from 14->12 cycles! Very >exciting. > >As it turns out, I decided -not- to fold. I manage to scrape together some >cycles and populate spare registers with three of the constants for the search >(0xFF00FF00, 0xF0F0F0F0, and 0xCCCCCCCC). Bit 5 is stashed away in the >accumulator, and bit 4 I get through a comparison to see if the bit is in the >upper half of the folded 32-bit (from 64-bit) value. Bits 1, 2, and 3 come from >the technique I will demonstrate below. The least significant bit is similar, >but I don't need an extra register for it. > >The first version of this function unrolled the following: >test esi, mask ; esi = folded value >setnz reg8 ; reg8 = (esi & mask != 0 ? 1 : 0) >shl accum, 1 ; accum <<= 1 >add accum, reg8 ; accum += reg8 > >New version does this: >and reg32, esi ; Assume above + reg32 populated with mask >add reg32, -1 ; cf = (reg32 != 0 ? 1 : 0) >adc accum, accum ; accum = (accum << 1) + cf > >3 instructions per cycle is the maximum number of instructions per cycle -any- >x86 processor can dispatch. (You can technically achieve more on P6/K7 when you >encounter dispatch stalls without decode stalls.) Perfect! > >Theoretical (ignoring setup costs) runtime is 10 clocks, 9% faster than Walter's >table lookup routine in optimized assembly. Current execution time on Athlon is >12 clocks. That's pretty good. > >After Gerd's comment about the routine being inlined just about everywhere, I >figure I will put up the optimized bsf routine (no data, small length, slow), >optimized binary search (no data, long length, faster), and optimized table >lookup (small data, medium length, fastest). > >I intend also to post routines in both gcc inline assembly and Intel/VC inline >assembly form for easiest incorporation. > >I have tenative plans to work on a 64-bit popcount, too. AMD has published a >32-bit popcount and a 64-bit popcount, but I think I can double up their 32-bit >popcount and achieve better performance in 64-bit. > >-Matt Hi Matt, Have some problems to follow your thoughts on the fly, takes some time. As far as i see, you do a "special" popcount for trailing "ones" after bit = (bb & -bb) - 1, to get the index of the LSB. Looking foreward to your routines ;-) Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.