Author: Matt Taylor
Date: 06:46:27 12/03/02
Go up one level in this thread
On December 02, 2002 at 19:43:15, Gerd Isenberg wrote: >On December 02, 2002 at 18:49:10, Walter Faxon wrote: > >>On December 02, 2002 at 16:48:52, Gerd Isenberg wrote: >> >>>May be i found not the best way to to b&(-b) with mmx, but building the 64-bit >>>two's complement with mmx-dword is not so nice. First you have to do the one's >>>complement by pxor -1, then comparing low dword with -1 and building an >>>conditional overflow, adding 00:01 or 01:01... >>>So using 32-bit registers was the fastest so far, but that may require some >>>additional push/pop. >>> >>>Some times mesured in seconds with this dumb loop (nothing inlined): >>>K7XP2.1+ ~1.8GHz >>> > >Congrats, Walter!! > >10-bit pattern bsf PI2FD btr c LSB_64 >0x0000000011111133 15.3 18.0 19.1 22.8 17.8 >0x1010111010101110 19.7 18.5 19.6 23.4 17.8 >0x1111113300000000 20.6 18.0 19.1 22.8 17.8 > >>> >>>inlined are ~5 seconds faster >>Hi, Gerd. >> >>As long as you're test-comparing bit-search-and-reset codes, I wonder if you >>could please consider also comparing my C code version, posted on CCC with the >>subject "Another hacky method for bitboard bit extraction" on November 17. I >>repeat it below. You can of course make those changes required for proper >>compilation and comparison in your setup. Thanks! >> >>-- Walter >> >>Code follows: >>// --------------------------------------------------------------------------- >> >>typedef unsigned long long u64; // nonstandard >>typedef unsigned long u32; >>typedef unsigned char u8; >> >>extern const u8 LSB_64_table[154]; // bit number table >>#define LSB_64_adj -51 // offset to table base >>#define LSB_64_magic ( (u32)0x01C5FC81 ) // magic constant >> >>// --------------------------------------------------------------------------- >>// LSB_64() -- find, remove, report least-significant bit of 64. >>// Argument 'bb' must be non-null. Method: fold then table lookup. >>// Written by Walter Faxon, June 2002. No copyright. No warranty. >>// >>inline // inline declaration may differ by compiler >>u8 LSB_64( u64* bb ) >> { >> u64 t64; >> u32 t32; >> t64 = *bb - 1; >> *bb &= t64; // omit this line to retain current LSB >> t64 ^= *bb; >> t32 = (u32)t64 ^ (u32)(t64 >> 32); >> t32 ^= LSB_64_magic; >> t32 += t32 >> 16; >> t32 -= t32 >> 8; >> return LSB_64_table [LSB_64_adj + (u8)t32]; >> } >> >>// --------------------------------------------------------------------------- >>// Table reports number of low-order bit as 0, high-order as 63. >>// (Numbering can be reversed by changing this table.) >>// Important: arrange storage so that this table is kept in the cache. >>const u8 LSB_64_table[154] = >> { >>#define __ 0 >> 23,__,__,__,31,__,__,39,19,__, 17,16,18,__,47,10,20, 9, 8,11, >> 1, 0, 2,57,56,58, 3,12,__,59, __,__,21,__, 4,__,__,60,__,__, >> __,__,__,13,__,__,__,__,__,__, 5,__,__,61,__,__,__,__,__,__, >> __,__,__,__,22,__,__,__,30,__, __,38,__,__,__,14,__,__,46,__, >> __,__, 6,__,__,62,__,__,__,54, __,__,__,__,__,__,__,__,__,__, >> 29,__,__,37,__,__,__,__,__,__, 45,__,__,__,__,__,28,__,__,36, >> __,53,__,__,27,__,44,35,26,24, 25,34,32,33,43,40,41,52,42,15, >> __,50,48,49,__,51, 7,__,__,63, __,__,__,55 >>#undef __ >> }; >> >>//eof >> >>P.S. You can even avoid the table lookup if you're willing to deal with >>scrambled square indecies in the range 0 to 153. >> >>P.P.S. I feel a little like the unlucky scientist whose results were lost >>because they were only published _four_ times... :) > >Geniously, sorry Walter, that i overlooked your approach. It is clearly the >fastest of these routines. De Bruijn Sequence? May have a close look to the >constant. > >I'm really impressed. > >Regards, >Gerd I measured the time (in clocks) to find all 10 bits on my AthlonMP 1600: 10-bit pattern bsf PI2FD btr LSB/VC7 LSB/Me 0x0000000011111133 659 919 779 699 699 0x1010111010101110 722-761 919 779 699 699 0x1111113300000000 729 919 779 699 699 The first three routines were taken instruction-for-instruction from this thread. The second-to-last is the C version optimized by the VC 7 compiler. The last is a hand-tweaked version I wrote, very basically optimized (lots of register contention and partial stalls). I can post specific assembly listings if someone is -that- interested. I didn't include the other C routine given by Gerd. The bsf instruction varied quite a bit in execution time. (The second pattern varied significantly in execution time over multiple trials, strangely enough...) All the other times are +/- 1-3 clocks, but I would only take the numbers accurate to about 10 clocks. I think it's fairly clear who the winner is, though. Very slick, Walter.
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.