Author: Filip Tvrzsky
Date: 16:31:15 04/21/03
Go up one level in this thread
On April 19, 2003 at 12:34:22, Dieter Buerssner wrote: >On April 19, 2003 at 05:29:13, Gerd Isenberg wrote: > >>Yes Dieter, i tried a lot. > >Gerd, I have no doubt about this. I only questioned the 2 specific routines (the >loops on the 2 32 bit parts, once in assembler, once in C). I questioned it >only, because I looked at the generated assembler of the C-Code, which was >practically identical, to the assembly you showed. > >Anyway, I made a small test program. dieter() is the C-Code I had shown, gerd() >your assembly (I wrote the same in GCC-style, too), and crafty() the routine Uri >showed in this thread. In my test, gerd() was not faster than dieter() (dieter() >was actually slightly faster than gerd() using MSVC, and the same using gcc). As >should be expected, with this specific test, crafty() was a bit slower (much >more with gcc than with MSVC). I used -O2 for both compilers. No doubt, one can >argue about my testing procedure. I actually made it out of fun only, not to do >some scientific experiments. I also used a union for bitboard, and changed for >this the functions in the thread slightly. I just did not want to try which >tricks would work best to convince the compiler to extract the 2 32-bit parts >efficiently. > >Regards, >Dieter > >#ifdef __GNUC__ >#define MYINLINE static __inline__ >typedef unsigned long long BITBOARD; >#else >#define MYINLINE static __forceinline >typedef unsigned __int64 BITBOARD; >#endif > >/* Note, this is not strictly portable, even not the pure C code*/ >typedef union >{ > BITBOARD b; > unsigned long w[2]; >} BU; > > >MYINLINE int crafty(register BU a) { > register int c=0; > > while(a.b) { > c++; > a.b &= a.b - 1; > } > return(c); >} > >MYINLINE int dieter(BU a) >{ > unsigned long w; > int n = 0; > w = a.w[0]; > if (w) > do > { > n++; > } > while ((w &= w-1) != 0); > w = a.w[1]; > if (w) > do > { > n++; > } > while ((w &= w-1) != 0); > return n; >} > >#ifdef __GNUC__ > >MYINLINE int gerd(BU a) >{ > int tmp, tmp2, n; > __asm__ __volatile__( > " movl %3, %1 > xorl %0, %0 > testl %1, %1 > je 1f > 0: incl %0 > leal -1(%1), %2 > andl %2, %1 > jne 0b > 1: movl %4, %1 > testl %1, %1 > je 3f > 2: incl %0 > leal -1(%1), %2 > andl %2, %1 > jne 2b > 3:" > : "=r&" (n), "=r&" (tmp), "=r&" (tmp2) > : "g" (a.w[0]), "g" (a.w[1]) > : "cc" /* Flags "condition code" changed */); > return n; >} > >#else > >MYINLINE int gerd (BU bb) >{ > __asm > { > mov ecx, dword ptr bb > xor eax, eax > test ecx, ecx > jz l1 > l0: lea edx, [ecx-1] > inc eax > and ecx, edx > jnz l0 > l1: mov ecx, dword ptr bb+4 > test ecx, ecx > jz l3 > l2: lea edx, [ecx-1] > inc eax > and ecx, edx > jnz l2 > l3: > } >} >#endif > >#include <stdlib.h> >#include <stdio.h> >#include <time.h> > >#define NLOOPS 10000000L >#define NBB 100 /* Number of different bitboards */ >#define NBITS 4 /* That many bits set randomly, can be less */ > >#define PRNG rand >#define MY_RAND_MAX RAND_MAX >unsigned long rand_range(unsigned long range) >{ > unsigned long rmax, r, d; > /* find the largest number rmax <= MY_RAND_MAX, for which > (rmax+1) % range == 0. > All returns from rand() > rmax will be skipped, to guarantee > equal probability for all return values. */ > d = (MY_RAND_MAX+1U-range) / range + 1; > rmax = d * range - 1; /* -1 to avoid "overflow to zero" */ > do > r = PRNG(); > while (r > rmax); > return r/d; >} > >BU bbs[NBB]; > >/* Define functions in "wrong" order, to not let the compiler > do stupid inlining, that can make the code slower */ >unsigned long foo(void); >unsigned long bar(void); > >int main(void) >{ > int i, j; > clock_t c; > /* Set some random bitboards */ > for (i=0; i<NBB; i++) > for (j=0; j<NBITS; j++) > bbs[i].b |= (BITBOARD)1 << rand_range(64); > c = clock(); > /* Print number of counted bits, so we get an indicication, > when something goes wrong. My easily overflow. */ > printf("res %lu\n", bar()); > c = clock()-c; > printf("used %.3f s\n", (double)c/CLOCKS_PER_SEC); > return 0; >} > >unsigned long bar(void) >{ > long i; > unsigned long n=0; > for (i=0; i<NLOOPS; i++) > n += foo(); > return n; >} > >unsigned long foo(void) >{ > int i; > unsigned long n=0; > for (i=0; i<NBB; i++) > n += crafty(bbs[i]); > return n; >} It is very strange, but my test results are rather different! Compiler is gcc 3.2, CPU is Duron 600@900. 1) With only -O3 optimization: res 3880000000 gerd() 32,520 s crafty() 47,950 s dieter() 47,010 s 2) With following optimization settings set (-O3 and -fsave-memoized -fno-exceptions -fmerge-all-constants -save-temps -march=athlon -mcpu=athlon -mmmx -funroll-loops -fomit-frame-pointer) which is my favorite now: res 3880000000 gerd() 18,350 s crafty() 37,290 s dieter() 41,750 s Maybe, I should little bit investigate compiler assembler output ... Filip
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.