Author: Walter Faxon
Date: 06:03:06 07/11/03
Go up one level in this thread
On July 11, 2003 at 03:30:33, Gerd Isenberg wrote:
>On July 10, 2003 at 21:05:09, Russell Reagan wrote:
>
>>Thanks Gerd. This method seems approximately equal to the bsf assembler method
>>in speed on my Athlon (maybe a hair slower). Maybe this will be the fastest on a
>>64-bit cpu?
>
>On x86-32 Walter's routine should be faster - due the mul64 call with three 32
>bit muls (vector path on athlon). On x86-64 there is only one mul rax, and that
>is double direct path.
>
>Gerd
Matt does it better by doing one fold prior just a 32-bit multiply. This email
is from him to me, I think quoting from an email he wrote to
http://www.hackersdelight.org/ , based on the book and a good source for bit
hackers.
Start quote.
--------------------
For your bit twiddling delight. The last paragraph is the best part, I think:
"...so my algorithm actually beats the microcode bsf instruction when you count
OOOE scheduling."
--------------------
Thought this might be a useful addition to your webpage.
I was looking at ways to improve lzc & tzc. They fall back to the ones
function, but ones is a generic popcount. We can assume that lzc & tzc will
only popcount members of the set {0, 1, 3, ..., 2^(bits - 1) - 1}. For this
case, I have a faster popcount.
My assumption is that for any particular number of bits there is a magic number
that makes ((x * magic) >> log2_floor(bits)) & (1 << log2_ceil(bits)) return
unique values for all x in {0, 1, 3, 7, ..., 2^(bits - 1) - 1}. I have
experimentally found that between 2 & 32 bits there is always a magic number
that produces a corresponding table with no more than log2_ceil(bits) entries.
It is probably safe to say that this holds true above 32 bits as well, but I
don't really understand why it works.
Here is my result for 16-bits:
const int magic16 = 0xE136;
int lzc16_tbl[] =
{ 0, 15, 3, 4, 5, 8, 6, 12,
9, 14, 2, 7, 11, 13, 1, 10};
int lzc16(int x)
{
x ^= (x - 1);
x *= magic16;
return lzc16_tbl[(x >> 4) & 15];
}
Here's a generic version:
const int magic; // you compute this
int lzc_tbl[] = {...}; // you also compute this
int lzc(int x)
{
x ^= (x - 1);
x *= magic;
return lzc_tbl[(x >> log2_floor(bits)) & (1 << log2_ceil(bits))];
}
For lzc64, I xor the high half with the low half right after the x ^= (x - 1)
step. This changes the assumption a little, but I am still able to build a
64-entry table, and working in 32-bits on my Athlon allows me to do an lzc64 in
no more than 10 cycles (assuming a cache hit). For comparison, bsf r32, r32
finishes in 8 cycles. Reliable sources inform me that Opteron/Athlon64 will bsf
r64, r64 in 9 cycles. However, the multiplier is also faster, so my algorithm
actually beats the microcode bsf instruction when you count OOOE scheduling.
int lsz64_tbl[64] =
{ 0, 31, 4, 33, 60, 15, 12, 34, 61, 25, 51, 10, 56, 20, 22, 35,
62, 30, 3, 54, 52, 24, 42, 19, 57, 29, 2, 44, 47, 28, 1, 36,
63, 32, 59, 5, 6, 50, 55, 7, 16, 53, 13, 41, 8, 43, 46, 17,
26, 58, 49, 14, 11, 40, 9, 45, 21, 48, 39, 23, 18, 38, 37, 27};
int lzc64(long long x)
{
int x32;
x ^= (x - 1);
x32 = ((int) x) ^ ((int)(x >> 32));
x32 *= magic;
return lzc_tbl[(x32 >> 26) & 63];
}
-Matt Taylor
--------------------
End quote.
You can get the last magic value by searching the 32-bit space for matches to
his table; it might take a second or so. I'm doing it now... :)
-- Walter
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.