Author: Matt Taylor
Date: 03:02:13 01/24/03
Ah, I was -just- about to give up and post my results when I made another breakthrough. The routine I was optimizing went from 14->12 cycles! Very exciting. As it turns out, I decided -not- to fold. I manage to scrape together some cycles and populate spare registers with three of the constants for the search (0xFF00FF00, 0xF0F0F0F0, and 0xCCCCCCCC). Bit 5 is stashed away in the accumulator, and bit 4 I get through a comparison to see if the bit is in the upper half of the folded 32-bit (from 64-bit) value. Bits 1, 2, and 3 come from the technique I will demonstrate below. The least significant bit is similar, but I don't need an extra register for it. The first version of this function unrolled the following: test esi, mask ; esi = folded value setnz reg8 ; reg8 = (esi & mask != 0 ? 1 : 0) shl accum, 1 ; accum <<= 1 add accum, reg8 ; accum += reg8 New version does this: and reg32, esi ; Assume above + reg32 populated with mask add reg32, -1 ; cf = (reg32 != 0 ? 1 : 0) adc accum, accum ; accum = (accum << 1) + cf 3 instructions per cycle is the maximum number of instructions per cycle -any- x86 processor can dispatch. (You can technically achieve more on P6/K7 when you encounter dispatch stalls without decode stalls.) Perfect! Theoretical (ignoring setup costs) runtime is 10 clocks, 9% faster than Walter's table lookup routine in optimized assembly. Current execution time on Athlon is 12 clocks. That's pretty good. After Gerd's comment about the routine being inlined just about everywhere, I figure I will put up the optimized bsf routine (no data, small length, slow), optimized binary search (no data, long length, faster), and optimized table lookup (small data, medium length, fastest). I intend also to post routines in both gcc inline assembly and Intel/VC inline assembly form for easiest incorporation. I have tenative plans to work on a 64-bit popcount, too. AMD has published a 32-bit popcount and a 64-bit popcount, but I think I can double up their 32-bit popcount and achieve better performance in 64-bit. -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.