# Computer Chess Club Archives

## Messages

### Subject: Re: Matt Taylor's magic de Bruijn Constant

Author: Andrew Williams

Date: 04:43:26 07/13/03

Go up one level in this thread

```Probably a stupid question, but if you have a loop which does just LastOne with
this scheme, wouldn't the table get cached and therefore make it go much faster
than in a test where it's embedded in your code?

AW

On July 13, 2003 at 06:57:48, Bas Hamstra wrote:

>Coicidentally I am working a bit on LastOne too right now. I also have a Athlon
>XP (like you I believe). So BSF isn't too fast. The question is: can you
>outperform it? At first it seems so, in a zillion times loop the following code
>appeared to be nearly a factor 3 faster than the BSF routine below:
>
>const char Tabel[64] =
>		{	0, 0, 0,15, 0, 1,28, 0,16, 0, 0, 0, 2,21,29, 0,
>			0, 0,19,17,10, 0,12, 0, 0, 3, 0, 6, 0,22,30, 0,
>			14, 0,27, 0, 0, 0,20, 0,18, 9,11, 0, 5, 0, 0,13,
>			26, 0, 0, 8, 0, 4, 0,25, 0, 7,24, 0,23, 0,31, 0
>		};
>
>
>__forceinline int LastOne2(unsigned __int64 Bits)
>
>{	unsigned int *p = (unsigned int*) &Bits;
>	unsigned int Short;
>
>	if(p[0])
>	{	Short = p[0];
>		Short = Short ^ (Short-1);
>		Short *= 7*255*255*25;
>		Short >>= 26;
>		return Tabel[Short&63];
>	}
>	Short = p[1];
>	Short = Short ^ (Short-1);
>	Short *= 7*255*255*255;
>	Short >>= 26;
>	return Tabel[Short&63]+32;
>}
>
>VC produces very efficient asm code for it, as far as I can judge. However how
>about the overall performance in the program? It seems to be noticably slower
>than plain BSF, like below:
>
>__forceinline int LastOne(BB M)
>{   __asm
>    {       mov EAX, dword ptr [M]
>            XOR EDX, EDX
>            cmp EAX, 0
>            jnz L2
>            mov EAX, dword ptr [M+4]
>        L2: bsf EAX, EAX
>    }
>}
>
>Both routines have an extra branch to test the first 32 bits, so that is not the
>problem!?
>
>
>Regards,
>Bas.
>
>
>
>
>
>
>
>
>On July 13, 2003 at 06:03:00, Gerd Isenberg wrote:
>
>>Hi all,
>>
>>Matt Taylor's invention of the super magic de Bruijn Constant 0x78291ACF is of
>>course usefull for yet another, but fast and portable _firstOne_ Bitscan routine
>>with only one 32bit multiplication!
>>
>>
>>int firstOneKey(BitBoard bb)
>>{
>>	BitBoard lsb = bb ^ (bb - 1);
>>	unsigned int foldedLSB = ((int) lsb) ^ ((int)(lsb>>32));
>>	return (foldedLSB * 0x78291ACF) >> (32-6); // range is 0..63
>>       // to get the bit index, a table lookup is required
>>}
>>
>>
>>Because i'am still looking for a fast key-fuction for two single populated move
>>bitboards, i tried the obvious approach to use Matt's routine twice. It works
>>like firstOneKey(from)*64 + firtsOneKey(to) and therefore maps to a 64*64-range.
>>
>>int getMoveKey(BitBoard frbit, BitBoard tobit)
>>{
>>	frbit -= 1; tobit -= 1;
>>	unsigned int foldedfr = ((int) frbit)  ^ ((int)(frbit>>32));
>>	unsigned int foldedto = ((int) tobit)  ^ ((int)(tobit>>32));
>>	return ((foldedfr*0x78291ACF) ^ ((foldedto*0x78291ACF) >> 6)) >> 20;
>>}
>>
>>x86 assembler output of this routine:
>>
>>00401435 83 C6 FF             add         esi,0FFFFFFFFh
>>00401438 83 D1 FF             adc         ecx,0FFFFFFFFh
>>0040143B 83 C2 FF             add         edx,0FFFFFFFFh
>>0040143E 83 D0 FF             adc         eax,0FFFFFFFFh
>>00401441 33 CE                xor         ecx,esi
>>00401443 33 C2                xor         eax,edx
>>00401445 69 C9 CF 1A 29 78    imul        ecx,ecx,78291ACFh
>>0040144B 69 C0 CF 1A 29 78    imul        eax,eax,78291ACFh
>>00401451 C1 E8 06             shr         eax,6
>>00401454 33 C1                xor         eax,ecx
>>00401456 C1 E8 14             shr         eax,14h
>>
>>only 36 Bytes - but two 32-bit multiplications.
>>
>>Two questions:
>>Any ideas to it better, eg. only by one multiplication?
>>Why are the unsigned multiplications translated to "imul" by the compiler?
>>