Author: martin fierz
Date: 03:56:50 12/05/03
Go up one level in this thread
hi russell,
did you also compare this to an __asm implementation of firstone? you can also
find that in the crafty source.
i used to do a table-based bitscan with 16-bit tables, but the __asm is clearly
faster.
cheers
martin
On December 05, 2003 at 00:22:11, Russell Reagan wrote:
>I found an old piece of code I had written to test the speed of a few
>bitscanning routines (ex. FirstOne() in Crafty). I was curious how different
>they would run on a 64-bit machine, so I sent them to Dr. Hyatt and he was kind
>enough to run them for me on the Opteron. Thanks Bob ;-)
>
>The source code of my program is below. The output of the program is in the
>form: <name of routine> <checksum> <seconds>. Lower times are better.
>
>Here are the results on my 32-bit 2GHz Athlon:
>
> magic bitscan 2980130816 11.987
> magic bitscan 2980130816 12.207
> magic bitscan 2980130816 12.338
> magic bitscan 2980130816 12.448
> ------------------------------
> table 16 2980130816 7.250
> table 16 2980130816 7.251
> table 16 2980130816 7.140
> table 16 2980130816 7.040
> ------------------------------
> gerd 2980130816 9.243
> gerd 2980130816 9.264
> gerd 2980130816 9.253
> gerd 2980130816 9.273
> ------------------------------
> eugene 2980130816 7.201
> eugene 2980130816 7.170
> eugene 2980130816 7.190
> eugene 2980130816 7.191
> ------------------------------
> eugene2 2980130816 7.230
> eugene2 2980130816 7.261
> eugene2 2980130816 7.250
> eugene2 2980130816 6.780
>
>Here are the results from the 1.8ghz Opteron:
>
> magic bitscan 2980130816 4.040
> magic bitscan 2980130816 4.350
> magic bitscan 2980130816 4.350
> magic bitscan 2980130816 4.350
> ------------------------------
> table 16 2980130816 4.600
> table 16 2980130816 4.820
> table 16 2980130816 4.240
> table 16 2980130816 4.340
> ------------------------------
> gerd 2980130816 5.790
> gerd 2980130816 5.780
> gerd 2980130816 5.440
> gerd 2980130816 5.440
> ------------------------------
> eugene 2980130816 3.990
> eugene 2980130816 3.960
> eugene 2980130816 3.770
> eugene 2980130816 3.470
> ------------------------------
> eugene2 2980130816 3.690
> eugene2 2980130816 3.690
> eugene2 2980130816 3.680
> eugene2 2980130816 3.690
>
>Here is the output of Bob's FirstOne() written in assembly, run on the Opteron:
>
> ------------------------------
> hyatt 2980130816 5.100
> hyatt 2980130816 5.090
> hyatt 2980130816 5.100
> hyatt 2980130816 5.100
>
>Bob said, "I suspect my approach is actually faster when you use it in some
>real code, as it gets inlined very compactly..." and he is probably right.
>Cramming functions into loops like this isn't always the best way to benchmark
>something.
>
>I used Cygwin gcc to run my test, and Bob used cc on the Opteron. We both used
>-O3.
>
>Here are some short descriptions of the different bitscans:
>
>magic bitscan
>Created by (I think) Matt Taylor, or maybe it was Gerd Isenburg. It uses de
>Bruijn sequences (in the form of a "magic" number). Really cool.
>
>table 16
>I got this one from Dann Corbit, who (I think) got it from Beowulf. It uses a
>16-bit lookup table, which kind of makes me cringe on a machine with a small
>amount of cache, but the Opteron has 1 MB of L2 so maybe we can live with it.
>
>gerd
>I got this one from (I think) Gerd Isenburg, or maybe it was Matt Taylor. It
>uses the same magic number concept, but in a way that is intended to be more
>32-bit friendly, so it is not suprising that it is slower than the magic bitscan
>on 64-bit hardware.
>
>eugene
>I found this in the CCC archives, written by Eugene Nalimov. He said it would
>run in like 9-12 cycles (can't remember exactly) on the Itanium. It uses an
>8-bit lookup table, which is pretty cache friendly.
>
>eugene2
>This is the same as "eugene", but it uses the 16-bit lookup table instead of the
>8-bit table.
>
>hyatt
>This is Bob's FirstOne() written in assembly for the Opteron. I think it uses
>bsrq.
>
>Here is the source code of the program we used:
>
>Or you can download it here: http://home.comcast.net/~r.reagan/bitscan.c
>
>///////////////////////////////////////////////////////////////////////////////
>
>#include <stdio.h>
>#include <time.h>
>
>///////////////////////////////////////////////////////////////////////////////
>
>typedef unsigned long long Bitboard;
>
>///////////////////////////////////////////////////////////////////////////////
>
>// Each bitscan will run max*64 times
>
>const int max = 10000000;
>
>////////////////////////////////////////////////////////////////////////////////
>
>const Bitboard mask[64] = {
>0x0000000000000001ULL, 0x0000000000000002ULL,
>0x0000000000000004ULL, 0x0000000000000008ULL,
>0x0000000000000010ULL, 0x0000000000000020ULL,
>0x0000000000000040ULL, 0x0000000000000080ULL,
>0x0000000000000100ULL, 0x0000000000000200ULL,
>0x0000000000000400ULL, 0x0000000000000800ULL,
>0x0000000000001000ULL, 0x0000000000002000ULL,
>0x0000000000004000ULL, 0x0000000000008000ULL,
>0x0000000000010000ULL, 0x0000000000020000ULL,
>0x0000000000040000ULL, 0x0000000000080000ULL,
>0x0000000000100000ULL, 0x0000000000200000ULL,
>0x0000000000400000ULL, 0x0000000000800000ULL,
>0x0000000001000000ULL, 0x0000000002000000ULL,
>0x0000000004000000ULL, 0x0000000008000000ULL,
>0x0000000010000000ULL, 0x0000000020000000ULL,
>0x0000000040000000ULL, 0x0000000080000000ULL,
>0x0000000100000000ULL, 0x0000000200000000ULL,
>0x0000000400000000ULL, 0x0000000800000000ULL,
>0x0000001000000000ULL, 0x0000002000000000ULL,
>0x0000004000000000ULL, 0x0000008000000000ULL,
>0x0000010000000000ULL, 0x0000020000000000ULL,
>0x0000040000000000ULL, 0x0000080000000000ULL,
>0x0000100000000000ULL, 0x0000200000000000ULL,
>0x0000400000000000ULL, 0x0000800000000000ULL,
>0x0001000000000000ULL, 0x0002000000000000ULL,
>0x0004000000000000ULL, 0x0008000000000000ULL,
>0x0010000000000000ULL, 0x0020000000000000ULL,
>0x0040000000000000ULL, 0x0080000000000000ULL,
>0x0100000000000000ULL, 0x0200000000000000ULL,
>0x0400000000000000ULL, 0x0800000000000000ULL,
>0x1000000000000000ULL, 0x2000000000000000ULL,
>0x4000000000000000ULL, 0x8000000000000000ULL,
>};
>
>
>///////////////////////////////////////////////////////////////////////////////
>
>int Index (int x, int y) {
> return (y << 3) + x;
>}
>
>///////////////////////////////////////////////////////////////////////////////
>
>const Bitboard magic = 0x03f08c5392f756cdULL;
>
>const int table[64] = {
> 0, 1, 12, 2, 13, 22, 17, 3,
> 14, 33, 23, 36, 18, 58, 28, 4,
> 62, 15, 34, 26, 24, 48, 50, 37,
> 19, 55, 59, 52, 29, 44, 39, 5,
> 63, 11, 21, 16, 32, 35, 57, 27,
> 61, 25, 47, 49, 54, 51, 43, 38,
> 10, 20, 31, 56, 60, 46, 53, 42,
> 9, 30, 45, 41, 8, 40, 7, 6,
>};
>
>int MagicBitscan (const Bitboard b) {
> const Bitboard lsb = b & -b;
> //bb ^= lsb;
> return table[(lsb * magic) >> 58];
>}
>
>void TestMagicBitscan () {
> clock_t start, stop;
> int count, i;
> unsigned long total = 0;
>
> start = clock();
> for (count = 0; count < max; count++) {
> for (i = 0; i < 64; i++) {
> total += MagicBitscan(mask[i]);
> }
> }
> stop = clock();
> printf("%15s %10u %.3f\n", "magic bitscan", total, (float)(stop - start) /
>CLOCKS_PER_SEC);
>}
>
>///////////////////////////////////////////////////////////////////////////////
>
>int table16[65536];
>
>void InitTable16 () {
> Bitboard i;
> for (i = 0; i < 65536; i++)
> table16[i] = MagicBitscan(i);
>}
>
>int table8[256];
>
>void InitTable8 () {
> Bitboard i;
> for (i = 0; i < 256; i++)
> table8[i] = MagicBitscan(i);
>}
>
>////////////////////////////////////////////////////////////////////////////////
>
>int TableBitscan16 (Bitboard b) {
> static const Bitboard TwoFullRanks = 65535;
> static const Bitboard FPMask1 = (Bitboard) TwoFullRanks << 16;
> static const Bitboard FPMask2 = (Bitboard) TwoFullRanks << 32;
>
> if (b & TwoFullRanks)
> return table16[b & TwoFullRanks];
> if (b & FPMask1)
> return table16[(b >> 16) & TwoFullRanks] + 16;
> if (b & FPMask2)
> return table16[(b >> 32) & TwoFullRanks] + 32;
> return table16[b >> 48] + 48;
>}
>
>void TestTableBitscan16 () {
> clock_t start, stop;
> int count, i;
> unsigned long total = 0;
>
> start = clock();
> for (count = 0; count < max; count++) {
> for (i = 0; i < 64; i++) {
> total += TableBitscan16(mask[i]);
> }
> }
> stop = clock();
> printf("%15s %10u %.3f\n", "table 16", total, (float)(stop - start) /
>CLOCKS_PER_SEC);
>}
>///////////////////////////////////////////////////////////////////////////////
>
>///////////////////////////////////////////////////////////////////////////////
>
>const int lsz64_tbl[64] =
>{
> 0,31, 4,33,60,15,12,34,
> 61,25,51,10,56,20,22,35,
> 62,30, 3,54,52,24,42,19,
> 57,29, 2,44,47,28, 1,36,
> 63,32,59, 5, 6,50,55, 7,
> 16,53,13,41, 8,43,46,17,
> 26,58,49,14,11,40, 9,45,
> 21,48,39,23,18,38,37,27,
>};
>
>int GerdBitScan (Bitboard bb)
>{
> const Bitboard lsb = (bb & -(long long)bb) - 1;
> //bb ^= lsb--;
> //lsb--;
> const unsigned int foldedLSB = ((int) lsb) ^ ((int)(lsb>>32));
> return lsz64_tbl[foldedLSB * 0x78291ACF >> 26];
>}
>
>void TestGerdBitscan () {
> clock_t start, stop;
> int count, i;
> unsigned long total = 0;
>
> start = clock();
> for (count = 0; count < max; count++) {
> for (i = 0; i < 64; i++) {
> total += GerdBitScan(mask[i]);
> }
> }
> stop = clock();
> printf("%15s %10u %.3f\n", "gerd", total, (float)(stop - start) /
>CLOCKS_PER_SEC);
>}
>
>///////////////////////////////////////////////////////////////////////////////
>
>int EugeneBitscan (Bitboard arg) {
> int result = 0;
>
> if (arg > 0xFFFFFFFF) {
> arg >>= 32;
> result = 32;
> }
>
> if (arg > 0xFFFF) {
> arg >>= 16;
> result += 16;
> }
>
> if (arg > 0xFF) {
> arg >>= 8;
> result += 8;
> }
>
> return result + table8[arg];
>}
>
>void TestEugeneBitscan () {
> clock_t start, stop;
> int count, i;
> unsigned long total = 0;
>
> start = clock();
> for (count = 0; count < max; count++) {
> for (i = 0; i < 64; i++) {
> total += EugeneBitscan(mask[i]);
> }
> }
> stop = clock();
> printf("%15s %10u %.3f\n", "eugene", total, (float)(stop - start) /
>CLOCKS_PER_SEC);
>}
>
>///////////////////////////////////////////////////////////////////////////////
>
>int EugeneBitscan2 (Bitboard arg) {
> int result = 0;
>
> if (arg > 0xFFFFFFFF) {
> arg >>= 32;
> result = 32;
> }
>
> if (arg > 0xFFFF) {
> arg >>= 16;
> result += 16;
> }
>
> return result + table16[arg];
>}
>
>void TestEugeneBitscan2 () {
> clock_t start, stop;
> int count, i;
> unsigned long total = 0;
>
> start = clock();
> for (count = 0; count < max; count++) {
> for (i = 0; i < 64; i++) {
> total += EugeneBitscan2(mask[i]);
> }
> }
> stop = clock();
> printf("%15s %10u %.3f\n", "eugene2", total, (float)(stop - start) /
>CLOCKS_PER_SEC);
>}
>
>////////////////////////////////////////////////////////////////////////////////
>
>int main () {
>
> InitTable16();
> InitTable8();
>
> TestMagicBitscan();
> TestMagicBitscan();
> TestMagicBitscan();
> TestMagicBitscan();
> printf(" ------------------------------\n");
>
> TestTableBitscan16();
> TestTableBitscan16();
> TestTableBitscan16();
> TestTableBitscan16();
> printf(" ------------------------------\n");
>
> TestGerdBitscan();
> TestGerdBitscan();
> TestGerdBitscan();
> TestGerdBitscan();
> printf(" ------------------------------\n");
>
> TestEugeneBitscan();
> TestEugeneBitscan();
> TestEugeneBitscan();
> TestEugeneBitscan();
> printf(" ------------------------------\n");
>
> TestEugeneBitscan2();
> TestEugeneBitscan2();
> TestEugeneBitscan2();
> TestEugeneBitscan2();
>
> return 0;
>}
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.