Author: Andrew Williams
Date: 04:43:26 07/13/03
Go up one level in this thread
Probably a stupid question, but if you have a loop which does just LastOne with this scheme, wouldn't the table get cached and therefore make it go much faster than in a test where it's embedded in your code? AW On July 13, 2003 at 06:57:48, Bas Hamstra wrote: >Coicidentally I am working a bit on LastOne too right now. I also have a Athlon >XP (like you I believe). So BSF isn't too fast. The question is: can you >outperform it? At first it seems so, in a zillion times loop the following code >appeared to be nearly a factor 3 faster than the BSF routine below: > >const char Tabel[64] = > { 0, 0, 0,15, 0, 1,28, 0,16, 0, 0, 0, 2,21,29, 0, > 0, 0,19,17,10, 0,12, 0, 0, 3, 0, 6, 0,22,30, 0, > 14, 0,27, 0, 0, 0,20, 0,18, 9,11, 0, 5, 0, 0,13, > 26, 0, 0, 8, 0, 4, 0,25, 0, 7,24, 0,23, 0,31, 0 > }; > > >__forceinline int LastOne2(unsigned __int64 Bits) > >{ unsigned int *p = (unsigned int*) &Bits; > unsigned int Short; > > if(p[0]) > { Short = p[0]; > Short = Short ^ (Short-1); > Short *= 7*255*255*25; > Short >>= 26; > return Tabel[Short&63]; > } > Short = p[1]; > Short = Short ^ (Short-1); > Short *= 7*255*255*255; > Short >>= 26; > return Tabel[Short&63]+32; >} > >VC produces very efficient asm code for it, as far as I can judge. However how >about the overall performance in the program? It seems to be noticably slower >than plain BSF, like below: > >__forceinline int LastOne(BB M) >{ __asm > { mov EAX, dword ptr [M] > XOR EDX, EDX > cmp EAX, 0 > jnz L2 > mov EAX, dword ptr [M+4] > add EDX, 32 > L2: bsf EAX, EAX > ADD EAX, EDX > } >} > >Both routines have an extra branch to test the first 32 bits, so that is not the >problem!? > > >Regards, >Bas. > > > > > > > > >On July 13, 2003 at 06:03:00, Gerd Isenberg wrote: > >>Hi all, >> >>Matt Taylor's invention of the super magic de Bruijn Constant 0x78291ACF is of >>course usefull for yet another, but fast and portable _firstOne_ Bitscan routine >>with only one 32bit multiplication! >> >> >>int firstOneKey(BitBoard bb) >>{ >> BitBoard lsb = bb ^ (bb - 1); >> unsigned int foldedLSB = ((int) lsb) ^ ((int)(lsb>>32)); >> return (foldedLSB * 0x78291ACF) >> (32-6); // range is 0..63 >> // to get the bit index, a table lookup is required >>} >> >> >>Because i'am still looking for a fast key-fuction for two single populated move >>bitboards, i tried the obvious approach to use Matt's routine twice. It works >>like firstOneKey(from)*64 + firtsOneKey(to) and therefore maps to a 64*64-range. >> >>int getMoveKey(BitBoard frbit, BitBoard tobit) >>{ >> frbit -= 1; tobit -= 1; >> unsigned int foldedfr = ((int) frbit) ^ ((int)(frbit>>32)); >> unsigned int foldedto = ((int) tobit) ^ ((int)(tobit>>32)); >> return ((foldedfr*0x78291ACF) ^ ((foldedto*0x78291ACF) >> 6)) >> 20; >>} >> >>x86 assembler output of this routine: >> >>00401435 83 C6 FF add esi,0FFFFFFFFh >>00401438 83 D1 FF adc ecx,0FFFFFFFFh >>0040143B 83 C2 FF add edx,0FFFFFFFFh >>0040143E 83 D0 FF adc eax,0FFFFFFFFh >>00401441 33 CE xor ecx,esi >>00401443 33 C2 xor eax,edx >>00401445 69 C9 CF 1A 29 78 imul ecx,ecx,78291ACFh >>0040144B 69 C0 CF 1A 29 78 imul eax,eax,78291ACFh >>00401451 C1 E8 06 shr eax,6 >>00401454 33 C1 xor eax,ecx >>00401456 C1 E8 14 shr eax,14h >> >>only 36 Bytes - but two 32-bit multiplications. >> >>Two questions: >>Any ideas to it better, eg. only by one multiplication? >>Why are the unsigned multiplications translated to "imul" by the compiler? >> >>Thanks in advance, >>Gerd
This page took 0.04 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.