Author: Bas Hamstra
Date: 03:57:48 07/13/03
Go up one level in this thread
Coicidentally I am working a bit on LastOne too right now. I also have a Athlon
XP (like you I believe). So BSF isn't too fast. The question is: can you
outperform it? At first it seems so, in a zillion times loop the following code
appeared to be nearly a factor 3 faster than the BSF routine below:
const char Tabel[64] =
{ 0, 0, 0,15, 0, 1,28, 0,16, 0, 0, 0, 2,21,29, 0,
0, 0,19,17,10, 0,12, 0, 0, 3, 0, 6, 0,22,30, 0,
14, 0,27, 0, 0, 0,20, 0,18, 9,11, 0, 5, 0, 0,13,
26, 0, 0, 8, 0, 4, 0,25, 0, 7,24, 0,23, 0,31, 0
};
__forceinline int LastOne2(unsigned __int64 Bits)
{ unsigned int *p = (unsigned int*) &Bits;
unsigned int Short;
if(p[0])
{ Short = p[0];
Short = Short ^ (Short-1);
Short *= 7*255*255*25;
Short >>= 26;
return Tabel[Short&63];
}
Short = p[1];
Short = Short ^ (Short-1);
Short *= 7*255*255*255;
Short >>= 26;
return Tabel[Short&63]+32;
}
VC produces very efficient asm code for it, as far as I can judge. However how
about the overall performance in the program? It seems to be noticably slower
than plain BSF, like below:
__forceinline int LastOne(BB M)
{ __asm
{ mov EAX, dword ptr [M]
XOR EDX, EDX
cmp EAX, 0
jnz L2
mov EAX, dword ptr [M+4]
add EDX, 32
L2: bsf EAX, EAX
ADD EAX, EDX
}
}
Both routines have an extra branch to test the first 32 bits, so that is not the
problem!?
Regards,
Bas.
On July 13, 2003 at 06:03:00, Gerd Isenberg wrote:
>Hi all,
>
>Matt Taylor's invention of the super magic de Bruijn Constant 0x78291ACF is of
>course usefull for yet another, but fast and portable _firstOne_ Bitscan routine
>with only one 32bit multiplication!
>
>
>int firstOneKey(BitBoard bb)
>{
> BitBoard lsb = bb ^ (bb - 1);
> unsigned int foldedLSB = ((int) lsb) ^ ((int)(lsb>>32));
> return (foldedLSB * 0x78291ACF) >> (32-6); // range is 0..63
> // to get the bit index, a table lookup is required
>}
>
>
>Because i'am still looking for a fast key-fuction for two single populated move
>bitboards, i tried the obvious approach to use Matt's routine twice. It works
>like firstOneKey(from)*64 + firtsOneKey(to) and therefore maps to a 64*64-range.
>
>int getMoveKey(BitBoard frbit, BitBoard tobit)
>{
> frbit -= 1; tobit -= 1;
> unsigned int foldedfr = ((int) frbit) ^ ((int)(frbit>>32));
> unsigned int foldedto = ((int) tobit) ^ ((int)(tobit>>32));
> return ((foldedfr*0x78291ACF) ^ ((foldedto*0x78291ACF) >> 6)) >> 20;
>}
>
>x86 assembler output of this routine:
>
>00401435 83 C6 FF add esi,0FFFFFFFFh
>00401438 83 D1 FF adc ecx,0FFFFFFFFh
>0040143B 83 C2 FF add edx,0FFFFFFFFh
>0040143E 83 D0 FF adc eax,0FFFFFFFFh
>00401441 33 CE xor ecx,esi
>00401443 33 C2 xor eax,edx
>00401445 69 C9 CF 1A 29 78 imul ecx,ecx,78291ACFh
>0040144B 69 C0 CF 1A 29 78 imul eax,eax,78291ACFh
>00401451 C1 E8 06 shr eax,6
>00401454 33 C1 xor eax,ecx
>00401456 C1 E8 14 shr eax,14h
>
>only 36 Bytes - but two 32-bit multiplications.
>
>Two questions:
>Any ideas to it better, eg. only by one multiplication?
>Why are the unsigned multiplications translated to "imul" by the compiler?
>
>Thanks in advance,
>Gerd
This page took 0.05 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.