Author: Robert Hyatt
Date: 17:45:32 02/18/04
Go up one level in this thread
On February 18, 2004 at 15:46:11, Dieter Buerssner wrote: >On February 17, 2004 at 20:47:37, Robert Hyatt wrote: > >>On February 17, 2004 at 19:18:46, Dieter Buerssner wrote: >> >>>On February 17, 2004 at 16:10:10, Dann Corbit wrote: >>> >>>How did you create the source? Just the ouutput (more or less) of the pre >>>processor run? It looks like it, but you still have the #include(s), which would >>>not be in that output. >>> >>>Anyway, your assembly seems to strengthen my point. It does not look worse, than >>>the "hand tuned" assembly code from (former versions of) Crafty. > >[snipped function, that is more suited to more populated BITBOARDS] > > >>I can at least answer that question. _Every_ time I have compared the asm to >>any C implementation of those functions, the asm has won _every_ time. > >Did you try tghe C-code I posted? It is clear, that your generic PopCnt() is >slower on 32-bit architectures.I just tried, with Gcc under Linux for Crafty >19.10 and the C-code I posted was faster - 839 kn/s vs. 830 kn/s for the Crafty >bench command. Numbers seem to be very reproducable (I did 3 runs for either). > >For example with the code as downloaded just 10 minutes ago from your ftp: > >Total nodes: 74698851 >Raw nodes per second: 829987 > >And with the following change (commented out the inline assembly for PopCnt and >used my C-code): > >#if 0 >int static __inline__ PopCnt(BITBOARD word) >{ >/* r0=result, %1=tmp, %2=first input, %3=second input */ > long dummy, dummy2; > >asm(" xorl %0, %0" "\n\t" > " testl %2, %2" "\n\t" > " jz 2f" "\n\t" > "1: leal -1(%2), %1" "\n\t" > " incl %0" "\n\t" > " andl %1, %2" "\n\t" > " jnz 1b" "\n\t" > "2: testl %3, %3" "\n\t" > " jz 4f" "\n\t" > "3: leal -1(%3), %1" "\n\t" > " incl %0" "\n\t" > " andl %1, %3" "\n\t" > " jnz 3b" "\n\t" > "4:" "\n\t" > : "=&q" (dummy), "=&q" (dummy2) > : "q" ((int) (word>>32)), "q" ((int) word) > : "cc"); > return (dummy); >} >#else >int static __inline__ PopCnt(BITBOARD a) >{ > unsigned long w; > int n = 0; > w = *(unsigned long *)&a; > if (w) > do > n++; > while ((w &= w-1) != 0); > w = *(((unsigned long *)&a)+1); > if (w) > do > n++; > while ((w &= w-1) != 0); > return n; >} >#endif > >I got: > >Total nodes: 74698851 >Raw nodes per second: 839312 > > > >>In >>Crafty. I'll mention that for PopCnt() my bitboards are sparsely populated when >>I do a PopCnt(). I have _never_ found any C code that will out-perform the old >>X86.s code. > >But it is rather obvious, that your generic PopCnt will be slower on 32-bit >architectures - not? In the assembly, you work on 2 32-bit words, while in C you >work always on 64-bit words. The C-code I suggested will also work on 2 32 bit >words and will produce essentially your inline assembly. But has the advantage, >that it will run unchanged very efficiently on many platforms Crafty supports >(independent of the specific architecture, compiler, ...). On the 64-bit >platforms, of course the loop on 64-bit words will be faster. BTW. I think my C >code will often (or even typically) use one register less. > >Regards, >Dieter How are you testing? IE when I use intel's compiler, with PGO, the inline is faster here. Not significantly, but still faster...
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.