Author: Dieter Buerssner
Date: 12:46:11 02/18/04
Go up one level in this thread
On February 17, 2004 at 20:47:37, Robert Hyatt wrote: >On February 17, 2004 at 19:18:46, Dieter Buerssner wrote: > >>On February 17, 2004 at 16:10:10, Dann Corbit wrote: >> >>How did you create the source? Just the ouutput (more or less) of the pre >>processor run? It looks like it, but you still have the #include(s), which would >>not be in that output. >> >>Anyway, your assembly seems to strengthen my point. It does not look worse, than >>the "hand tuned" assembly code from (former versions of) Crafty. [snipped function, that is more suited to more populated BITBOARDS] >I can at least answer that question. _Every_ time I have compared the asm to >any C implementation of those functions, the asm has won _every_ time. Did you try tghe C-code I posted? It is clear, that your generic PopCnt() is slower on 32-bit architectures.I just tried, with Gcc under Linux for Crafty 19.10 and the C-code I posted was faster - 839 kn/s vs. 830 kn/s for the Crafty bench command. Numbers seem to be very reproducable (I did 3 runs for either). For example with the code as downloaded just 10 minutes ago from your ftp: Total nodes: 74698851 Raw nodes per second: 829987 And with the following change (commented out the inline assembly for PopCnt and used my C-code): #if 0 int static __inline__ PopCnt(BITBOARD word) { /* r0=result, %1=tmp, %2=first input, %3=second input */ long dummy, dummy2; asm(" xorl %0, %0" "\n\t" " testl %2, %2" "\n\t" " jz 2f" "\n\t" "1: leal -1(%2), %1" "\n\t" " incl %0" "\n\t" " andl %1, %2" "\n\t" " jnz 1b" "\n\t" "2: testl %3, %3" "\n\t" " jz 4f" "\n\t" "3: leal -1(%3), %1" "\n\t" " incl %0" "\n\t" " andl %1, %3" "\n\t" " jnz 3b" "\n\t" "4:" "\n\t" : "=&q" (dummy), "=&q" (dummy2) : "q" ((int) (word>>32)), "q" ((int) word) : "cc"); return (dummy); } #else int static __inline__ PopCnt(BITBOARD a) { unsigned long w; int n = 0; w = *(unsigned long *)&a; if (w) do n++; while ((w &= w-1) != 0); w = *(((unsigned long *)&a)+1); if (w) do n++; while ((w &= w-1) != 0); return n; } #endif I got: Total nodes: 74698851 Raw nodes per second: 839312 >In >Crafty. I'll mention that for PopCnt() my bitboards are sparsely populated when >I do a PopCnt(). I have _never_ found any C code that will out-perform the old >X86.s code. But it is rather obvious, that your generic PopCnt will be slower on 32-bit architectures - not? In the assembly, you work on 2 32-bit words, while in C you work always on 64-bit words. The C-code I suggested will also work on 2 32 bit words and will produce essentially your inline assembly. But has the advantage, that it will run unchanged very efficiently on many platforms Crafty supports (independent of the specific architecture, compiler, ...). On the 64-bit platforms, of course the loop on 64-bit words will be faster. BTW. I think my C code will often (or even typically) use one register less. Regards, Dieter
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.