Author: Matt Taylor
Date: 06:05:30 12/03/02
Go up one level in this thread
On December 03, 2002 at 06:40:36, Gian-Carlo Pascutto wrote: >On December 02, 2002 at 19:43:15, Gerd Isenberg wrote: > >>Congrats, Walter!! >> >>10-bit pattern bsf PI2FD btr c LSB_64 >>0x0000000011111133 15.3 18.0 19.1 22.8 17.8 >>0x1010111010101110 19.7 18.5 19.6 23.4 17.8 >>0x1111113300000000 20.6 18.0 19.1 22.8 17.8 > >Does this mean that this is the fastest currently known >FindFirstAndRemove *and* that it doesn't need assembly? > >Impressive for sure! > >Any similar tricks for lightning-speed popcounting? > >What's the fastest sequence you've got for popcounting so >far? (preferably one that doesn't depend too much on new >instructions) > >-- >GCP Actually, part of his algorithm included a portion of one population count algorithm. The best average-case algorithm is the divide-and-conquer algorithm: int pop_count(u32 x) { // x becomes a vector of 2-bit sums (either 0, 1, or 2) x = ((x & 0xAAAAAAAA) >> 1) + (x & 0x55555555); // Now sum and reduce x += ((x & 0xCCCCCCCC) >> 2); x += ((x & 0x0F0F0F0F) >> 4); x += ((x & 0x00FF00FF) >> 8); x += ((x & 0x0000FFFF) >> 16); // Done. return x; } The 64-bit variant is the same, you just use longer integers and have 1 extra assignment statement at the end: x += ((x & 0x00000000FFFFFFFFFF) >> 32); I couldn't find the older AMD manuals, but a quick Google search yields a page that reiterates the algorithm and gives AMD's optimal assembly code for it. (Their assembly contains only instructions present since the 386 and prior.) I know imul on the Pentium was 10 clocks. I believe I have seen stated in the literature that imul is now a small cost of 2 or 3 cycles, much to my disbelief. http://www.df.lth.se/~john_e/gems/gem002d.html
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.