Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Bitscanning on the Opteron

Author: martin fierz

Date: 03:56:50 12/05/03

Go up one level in this thread


hi russell,

did you also compare this to an __asm implementation of firstone? you can also
find that in the crafty source.

i used to do a table-based bitscan with 16-bit tables, but the __asm is clearly
faster.

cheers
  martin



On December 05, 2003 at 00:22:11, Russell Reagan wrote:

>I found an old piece of code I had written to test the speed of a few
>bitscanning routines (ex. FirstOne() in Crafty). I was curious how different
>they would run on a 64-bit machine, so I sent them to Dr. Hyatt and he was kind
>enough to run them for me on the Opteron. Thanks Bob ;-)
>
>The source code of my program is below. The output of the program is in the
>form: <name of routine> <checksum> <seconds>. Lower times are better.
>
>Here are the results on my 32-bit 2GHz Athlon:
>
>  magic bitscan 2980130816 11.987
>  magic bitscan 2980130816 12.207
>  magic bitscan 2980130816 12.338
>  magic bitscan 2980130816 12.448
>  ------------------------------
>       table 16 2980130816 7.250
>       table 16 2980130816 7.251
>       table 16 2980130816 7.140
>       table 16 2980130816 7.040
>  ------------------------------
>           gerd 2980130816 9.243
>           gerd 2980130816 9.264
>           gerd 2980130816 9.253
>           gerd 2980130816 9.273
>  ------------------------------
>         eugene 2980130816 7.201
>         eugene 2980130816 7.170
>         eugene 2980130816 7.190
>         eugene 2980130816 7.191
>  ------------------------------
>        eugene2 2980130816 7.230
>        eugene2 2980130816 7.261
>        eugene2 2980130816 7.250
>        eugene2 2980130816 6.780
>
>Here are the results from the 1.8ghz Opteron:
>
>  magic bitscan 2980130816 4.040
>  magic bitscan 2980130816 4.350
>  magic bitscan 2980130816 4.350
>  magic bitscan 2980130816 4.350
>  ------------------------------
>       table 16 2980130816 4.600
>       table 16 2980130816 4.820
>       table 16 2980130816 4.240
>       table 16 2980130816 4.340
>  ------------------------------
>           gerd 2980130816 5.790
>           gerd 2980130816 5.780
>           gerd 2980130816 5.440
>           gerd 2980130816 5.440
>  ------------------------------
>         eugene 2980130816 3.990
>         eugene 2980130816 3.960
>         eugene 2980130816 3.770
>         eugene 2980130816 3.470
>  ------------------------------
>        eugene2 2980130816 3.690
>        eugene2 2980130816 3.690
>        eugene2 2980130816 3.680
>        eugene2 2980130816 3.690
>
>Here is the output of Bob's FirstOne() written in assembly, run on the Opteron:
>
>  ------------------------------
>          hyatt 2980130816 5.100
>          hyatt 2980130816 5.090
>          hyatt 2980130816 5.100
>          hyatt 2980130816 5.100
>
>Bob said, "I suspect my approach is actually faster when you use it in some
>real code, as it gets inlined very compactly..." and he is probably right.
>Cramming functions into loops like this isn't always the best way to benchmark
>something.
>
>I used Cygwin gcc to run my test, and Bob used cc on the Opteron. We both used
>-O3.
>
>Here are some short descriptions of the different bitscans:
>
>magic bitscan
>Created by (I think) Matt Taylor, or maybe it was Gerd Isenburg. It uses de
>Bruijn sequences (in the form of a "magic" number). Really cool.
>
>table 16
>I got this one from Dann Corbit, who (I think) got it from Beowulf. It uses a
>16-bit lookup table, which kind of makes me cringe on a machine with a small
>amount of cache, but the Opteron has 1 MB of L2 so maybe we can live with it.
>
>gerd
>I got this one from (I think) Gerd Isenburg, or maybe it was Matt Taylor. It
>uses the same magic number concept, but in a way that is intended to be more
>32-bit friendly, so it is not suprising that it is slower than the magic bitscan
>on 64-bit hardware.
>
>eugene
>I found this in the CCC archives, written by Eugene Nalimov. He said it would
>run in like 9-12 cycles (can't remember exactly) on the Itanium. It uses an
>8-bit lookup table, which is pretty cache friendly.
>
>eugene2
>This is the same as "eugene", but it uses the 16-bit lookup table instead of the
>8-bit table.
>
>hyatt
>This is Bob's FirstOne() written in assembly for the Opteron. I think it uses
>bsrq.
>
>Here is the source code of the program we used:
>
>Or you can download it here: http://home.comcast.net/~r.reagan/bitscan.c
>
>///////////////////////////////////////////////////////////////////////////////
>
>#include <stdio.h>
>#include <time.h>
>
>///////////////////////////////////////////////////////////////////////////////
>
>typedef unsigned long long Bitboard;
>
>///////////////////////////////////////////////////////////////////////////////
>
>// Each bitscan will run max*64 times
>
>const int max = 10000000;
>
>////////////////////////////////////////////////////////////////////////////////
>
>const Bitboard mask[64] = {
>0x0000000000000001ULL, 0x0000000000000002ULL,
>0x0000000000000004ULL, 0x0000000000000008ULL,
>0x0000000000000010ULL, 0x0000000000000020ULL,
>0x0000000000000040ULL, 0x0000000000000080ULL,
>0x0000000000000100ULL, 0x0000000000000200ULL,
>0x0000000000000400ULL, 0x0000000000000800ULL,
>0x0000000000001000ULL, 0x0000000000002000ULL,
>0x0000000000004000ULL, 0x0000000000008000ULL,
>0x0000000000010000ULL, 0x0000000000020000ULL,
>0x0000000000040000ULL, 0x0000000000080000ULL,
>0x0000000000100000ULL, 0x0000000000200000ULL,
>0x0000000000400000ULL, 0x0000000000800000ULL,
>0x0000000001000000ULL, 0x0000000002000000ULL,
>0x0000000004000000ULL, 0x0000000008000000ULL,
>0x0000000010000000ULL, 0x0000000020000000ULL,
>0x0000000040000000ULL, 0x0000000080000000ULL,
>0x0000000100000000ULL, 0x0000000200000000ULL,
>0x0000000400000000ULL, 0x0000000800000000ULL,
>0x0000001000000000ULL, 0x0000002000000000ULL,
>0x0000004000000000ULL, 0x0000008000000000ULL,
>0x0000010000000000ULL, 0x0000020000000000ULL,
>0x0000040000000000ULL, 0x0000080000000000ULL,
>0x0000100000000000ULL, 0x0000200000000000ULL,
>0x0000400000000000ULL, 0x0000800000000000ULL,
>0x0001000000000000ULL, 0x0002000000000000ULL,
>0x0004000000000000ULL, 0x0008000000000000ULL,
>0x0010000000000000ULL, 0x0020000000000000ULL,
>0x0040000000000000ULL, 0x0080000000000000ULL,
>0x0100000000000000ULL, 0x0200000000000000ULL,
>0x0400000000000000ULL, 0x0800000000000000ULL,
>0x1000000000000000ULL, 0x2000000000000000ULL,
>0x4000000000000000ULL, 0x8000000000000000ULL,
>};
>
>
>///////////////////////////////////////////////////////////////////////////////
>
>int Index (int x, int y) {
>    return (y << 3) + x;
>}
>
>///////////////////////////////////////////////////////////////////////////////
>
>const Bitboard magic = 0x03f08c5392f756cdULL;
>
>const int table[64] = {
>     0,  1, 12,  2, 13, 22, 17,  3,
>    14, 33, 23, 36, 18, 58, 28,  4,
>    62, 15, 34, 26, 24, 48, 50, 37,
>    19, 55, 59, 52, 29, 44, 39,  5,
>    63, 11, 21, 16, 32, 35, 57, 27,
>    61, 25, 47, 49, 54, 51, 43, 38,
>    10, 20, 31, 56, 60, 46, 53, 42,
>     9, 30, 45, 41,  8, 40,  7,  6,
>};
>
>int MagicBitscan (const Bitboard b) {
>    const Bitboard lsb = b & -b;
>    //bb ^= lsb;
>    return table[(lsb * magic) >> 58];
>}
>
>void TestMagicBitscan () {
>    clock_t start, stop;
>    int count, i;
>    unsigned long total = 0;
>
>    start = clock();
>    for (count = 0; count < max; count++) {
>        for (i = 0; i < 64; i++) {
>            total += MagicBitscan(mask[i]);
>        }
>    }
>    stop = clock();
>    printf("%15s %10u %.3f\n", "magic bitscan", total, (float)(stop - start) /
>CLOCKS_PER_SEC);
>}
>
>///////////////////////////////////////////////////////////////////////////////
>
>int table16[65536];
>
>void InitTable16 () {
>    Bitboard i;
>    for (i = 0; i < 65536; i++)
>        table16[i] = MagicBitscan(i);
>}
>
>int table8[256];
>
>void InitTable8 () {
>    Bitboard i;
>    for (i = 0; i < 256; i++)
>        table8[i] = MagicBitscan(i);
>}
>
>////////////////////////////////////////////////////////////////////////////////
>
>int TableBitscan16 (Bitboard b) {
>    static const Bitboard TwoFullRanks = 65535;
>    static const Bitboard FPMask1 = (Bitboard) TwoFullRanks << 16;
>    static const Bitboard FPMask2 = (Bitboard) TwoFullRanks << 32;
>
>    if (b & TwoFullRanks)
>        return table16[b & TwoFullRanks];
>    if (b & FPMask1)
>        return table16[(b >> 16) & TwoFullRanks] + 16;
>    if (b & FPMask2)
>        return table16[(b >> 32) & TwoFullRanks] + 32;
>    return table16[b >> 48] + 48;
>}
>
>void TestTableBitscan16 () {
>    clock_t start, stop;
>    int count, i;
>    unsigned long total = 0;
>
>    start = clock();
>    for (count = 0; count < max; count++) {
>        for (i = 0; i < 64; i++) {
>            total += TableBitscan16(mask[i]);
>        }
>    }
>    stop = clock();
>    printf("%15s %10u %.3f\n", "table 16", total, (float)(stop - start) /
>CLOCKS_PER_SEC);
>}
>///////////////////////////////////////////////////////////////////////////////
>
>///////////////////////////////////////////////////////////////////////////////
>
>const int lsz64_tbl[64] =
>{
>  0,31, 4,33,60,15,12,34,
> 61,25,51,10,56,20,22,35,
> 62,30, 3,54,52,24,42,19,
> 57,29, 2,44,47,28, 1,36,
> 63,32,59, 5, 6,50,55, 7,
> 16,53,13,41, 8,43,46,17,
> 26,58,49,14,11,40, 9,45,
> 21,48,39,23,18,38,37,27,
>};
>
>int GerdBitScan (Bitboard bb)
>{
>    const Bitboard lsb = (bb & -(long long)bb) - 1;
>    //bb ^= lsb--;
>    //lsb--;
>    const unsigned int foldedLSB = ((int) lsb) ^ ((int)(lsb>>32));
>    return lsz64_tbl[foldedLSB * 0x78291ACF >> 26];
>}
>
>void TestGerdBitscan () {
>    clock_t start, stop;
>    int count, i;
>    unsigned long total = 0;
>
>    start = clock();
>    for (count = 0; count < max; count++) {
>        for (i = 0; i < 64; i++) {
>            total += GerdBitScan(mask[i]);
>        }
>    }
>    stop = clock();
>    printf("%15s %10u %.3f\n", "gerd", total, (float)(stop - start) /
>CLOCKS_PER_SEC);
>}
>
>///////////////////////////////////////////////////////////////////////////////
>
>int EugeneBitscan (Bitboard arg) {
>    int result = 0;
>
>    if (arg > 0xFFFFFFFF) {
>    arg >>= 32;
>    result = 32;
>    }
>
>    if (arg > 0xFFFF) {
>    arg >>= 16;
>    result += 16;
>    }
>
>    if (arg > 0xFF) {
>    arg >>= 8;
>    result += 8;
>    }
>
>    return result + table8[arg];
>}
>
>void TestEugeneBitscan () {
>    clock_t start, stop;
>    int count, i;
>    unsigned long total = 0;
>
>    start = clock();
>    for (count = 0; count < max; count++) {
>        for (i = 0; i < 64; i++) {
>            total += EugeneBitscan(mask[i]);
>        }
>    }
>    stop = clock();
>    printf("%15s %10u %.3f\n", "eugene", total, (float)(stop - start) /
>CLOCKS_PER_SEC);
>}
>
>///////////////////////////////////////////////////////////////////////////////
>
>int EugeneBitscan2 (Bitboard arg) {
>    int result = 0;
>
>    if (arg > 0xFFFFFFFF) {
>    arg >>= 32;
>    result = 32;
>    }
>
>    if (arg > 0xFFFF) {
>    arg >>= 16;
>    result += 16;
>    }
>
>    return result + table16[arg];
>}
>
>void TestEugeneBitscan2 () {
>    clock_t start, stop;
>    int count, i;
>    unsigned long total = 0;
>
>    start = clock();
>    for (count = 0; count < max; count++) {
>        for (i = 0; i < 64; i++) {
>            total += EugeneBitscan2(mask[i]);
>        }
>    }
>    stop = clock();
>    printf("%15s %10u %.3f\n", "eugene2", total, (float)(stop - start) /
>CLOCKS_PER_SEC);
>}
>
>////////////////////////////////////////////////////////////////////////////////
>
>int main () {
>
>    InitTable16();
>    InitTable8();
>
>    TestMagicBitscan();
>    TestMagicBitscan();
>    TestMagicBitscan();
>    TestMagicBitscan();
>    printf("  ------------------------------\n");
>
>    TestTableBitscan16();
>    TestTableBitscan16();
>    TestTableBitscan16();
>    TestTableBitscan16();
>    printf("  ------------------------------\n");
>
>    TestGerdBitscan();
>    TestGerdBitscan();
>    TestGerdBitscan();
>    TestGerdBitscan();
>    printf("  ------------------------------\n");
>
>    TestEugeneBitscan();
>    TestEugeneBitscan();
>    TestEugeneBitscan();
>    TestEugeneBitscan();
>    printf("  ------------------------------\n");
>
>    TestEugeneBitscan2();
>    TestEugeneBitscan2();
>    TestEugeneBitscan2();
>    TestEugeneBitscan2();
>
>    return 0;
>}



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.