Author: Andrew Williams
Date: 04:43:26 07/13/03
Go up one level in this thread
Probably a stupid question, but if you have a loop which does just LastOne with
this scheme, wouldn't the table get cached and therefore make it go much faster
than in a test where it's embedded in your code?
AW
On July 13, 2003 at 06:57:48, Bas Hamstra wrote:
>Coicidentally I am working a bit on LastOne too right now. I also have a Athlon
>XP (like you I believe). So BSF isn't too fast. The question is: can you
>outperform it? At first it seems so, in a zillion times loop the following code
>appeared to be nearly a factor 3 faster than the BSF routine below:
>
>const char Tabel[64] =
> { 0, 0, 0,15, 0, 1,28, 0,16, 0, 0, 0, 2,21,29, 0,
> 0, 0,19,17,10, 0,12, 0, 0, 3, 0, 6, 0,22,30, 0,
> 14, 0,27, 0, 0, 0,20, 0,18, 9,11, 0, 5, 0, 0,13,
> 26, 0, 0, 8, 0, 4, 0,25, 0, 7,24, 0,23, 0,31, 0
> };
>
>
>__forceinline int LastOne2(unsigned __int64 Bits)
>
>{ unsigned int *p = (unsigned int*) &Bits;
> unsigned int Short;
>
> if(p[0])
> { Short = p[0];
> Short = Short ^ (Short-1);
> Short *= 7*255*255*25;
> Short >>= 26;
> return Tabel[Short&63];
> }
> Short = p[1];
> Short = Short ^ (Short-1);
> Short *= 7*255*255*255;
> Short >>= 26;
> return Tabel[Short&63]+32;
>}
>
>VC produces very efficient asm code for it, as far as I can judge. However how
>about the overall performance in the program? It seems to be noticably slower
>than plain BSF, like below:
>
>__forceinline int LastOne(BB M)
>{ __asm
> { mov EAX, dword ptr [M]
> XOR EDX, EDX
> cmp EAX, 0
> jnz L2
> mov EAX, dword ptr [M+4]
> add EDX, 32
> L2: bsf EAX, EAX
> ADD EAX, EDX
> }
>}
>
>Both routines have an extra branch to test the first 32 bits, so that is not the
>problem!?
>
>
>Regards,
>Bas.
>
>
>
>
>
>
>
>
>On July 13, 2003 at 06:03:00, Gerd Isenberg wrote:
>
>>Hi all,
>>
>>Matt Taylor's invention of the super magic de Bruijn Constant 0x78291ACF is of
>>course usefull for yet another, but fast and portable _firstOne_ Bitscan routine
>>with only one 32bit multiplication!
>>
>>
>>int firstOneKey(BitBoard bb)
>>{
>> BitBoard lsb = bb ^ (bb - 1);
>> unsigned int foldedLSB = ((int) lsb) ^ ((int)(lsb>>32));
>> return (foldedLSB * 0x78291ACF) >> (32-6); // range is 0..63
>> // to get the bit index, a table lookup is required
>>}
>>
>>
>>Because i'am still looking for a fast key-fuction for two single populated move
>>bitboards, i tried the obvious approach to use Matt's routine twice. It works
>>like firstOneKey(from)*64 + firtsOneKey(to) and therefore maps to a 64*64-range.
>>
>>int getMoveKey(BitBoard frbit, BitBoard tobit)
>>{
>> frbit -= 1; tobit -= 1;
>> unsigned int foldedfr = ((int) frbit) ^ ((int)(frbit>>32));
>> unsigned int foldedto = ((int) tobit) ^ ((int)(tobit>>32));
>> return ((foldedfr*0x78291ACF) ^ ((foldedto*0x78291ACF) >> 6)) >> 20;
>>}
>>
>>x86 assembler output of this routine:
>>
>>00401435 83 C6 FF add esi,0FFFFFFFFh
>>00401438 83 D1 FF adc ecx,0FFFFFFFFh
>>0040143B 83 C2 FF add edx,0FFFFFFFFh
>>0040143E 83 D0 FF adc eax,0FFFFFFFFh
>>00401441 33 CE xor ecx,esi
>>00401443 33 C2 xor eax,edx
>>00401445 69 C9 CF 1A 29 78 imul ecx,ecx,78291ACFh
>>0040144B 69 C0 CF 1A 29 78 imul eax,eax,78291ACFh
>>00401451 C1 E8 06 shr eax,6
>>00401454 33 C1 xor eax,ecx
>>00401456 C1 E8 14 shr eax,14h
>>
>>only 36 Bytes - but two 32-bit multiplications.
>>
>>Two questions:
>>Any ideas to it better, eg. only by one multiplication?
>>Why are the unsigned multiplications translated to "imul" by the compiler?
>>
>>Thanks in advance,
>>Gerd
This page took 0.03 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.