Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: There is huge potential to improve further

Author: Walter Faxon

Date: 06:03:06 07/11/03

Go up one level in this thread


On July 11, 2003 at 03:30:33, Gerd Isenberg wrote:

>On July 10, 2003 at 21:05:09, Russell Reagan wrote:
>
>>Thanks Gerd. This method seems approximately equal to the bsf assembler method
>>in speed on my Athlon (maybe a hair slower). Maybe this will be the fastest on a
>>64-bit cpu?
>
>On x86-32 Walter's routine should be faster - due the mul64 call with three 32
>bit muls (vector path on athlon). On x86-64 there is only one mul rax, and that
>is double direct path.
>
>Gerd


Matt does it better by doing one fold prior just a 32-bit multiply.  This email
is from him to me, I think quoting from an email he wrote to
http://www.hackersdelight.org/  , based on the book and a good source for bit
hackers.

Start quote.
--------------------
For your bit twiddling delight.  The last paragraph is the best part, I think:
"...so my algorithm actually beats the microcode bsf instruction when you count
OOOE scheduling."

--------------------
Thought this might be a useful addition to your webpage.

I was looking at ways to improve lzc & tzc.  They fall back to the ones
function, but ones is a generic popcount.  We can assume that lzc & tzc will
only popcount members of the set {0, 1, 3, ..., 2^(bits - 1) - 1}.  For this
case, I have a faster popcount.

My assumption is that for any particular number of bits there is a magic number
that makes ((x * magic) >> log2_floor(bits)) & (1 << log2_ceil(bits)) return
unique values for all x in {0, 1, 3, 7, ..., 2^(bits - 1) - 1}.  I have
experimentally found that between 2 & 32 bits there is always a magic number
that produces a corresponding table with no more than log2_ceil(bits) entries.
It is probably safe to say that this holds true above 32 bits as well, but I
don't really understand why it works.

Here is my result for 16-bits:

const int magic16 = 0xE136;
int lzc16_tbl[] =
{ 0, 15,  3,  4,  5,  8,  6, 12,
  9, 14,  2,  7, 11, 13,  1, 10};
int lzc16(int x)
{
   x ^= (x - 1);
   x *= magic16;
   return lzc16_tbl[(x >> 4) & 15];
}

Here's a generic version:

const int magic;  // you compute this
int lzc_tbl[] = {...};  // you also compute this
int lzc(int x)
{
   x ^= (x - 1);
   x *= magic;
   return lzc_tbl[(x >> log2_floor(bits)) & (1 << log2_ceil(bits))];
}

For lzc64, I xor the high half with the low half right after the x ^= (x - 1)
step.  This changes the assumption a little, but I am still able to build a
64-entry table, and working in 32-bits on my Athlon allows me to do an lzc64 in
no more than 10 cycles (assuming a cache hit).  For comparison, bsf r32, r32
finishes in 8 cycles.  Reliable sources inform me that Opteron/Athlon64 will bsf
r64, r64 in 9 cycles.  However, the multiplier is also faster, so my algorithm
actually beats the microcode bsf instruction when you count OOOE scheduling.

int lsz64_tbl[64] =
{ 0, 31,  4, 33, 60, 15, 12, 34, 61, 25, 51, 10, 56, 20, 22, 35,
 62, 30,  3, 54, 52, 24, 42, 19, 57, 29,  2, 44, 47, 28,  1, 36,
 63, 32, 59,  5,  6, 50, 55,  7, 16, 53, 13, 41,  8, 43, 46, 17,
 26, 58, 49, 14, 11, 40,  9, 45, 21, 48, 39, 23, 18, 38, 37, 27};

int lzc64(long long x)
{
   int x32;
   x ^= (x - 1);
   x32 = ((int) x) ^ ((int)(x >> 32));
   x32 *= magic;
   return lzc_tbl[(x32 >> 26) & 63];
}

-Matt Taylor
--------------------
End quote.

You can get the last magic value by searching the 32-bit space for matches to
his table; it might take a second or so.  I'm doing it now... :)

-- Walter



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.