Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Question for the Crafty/Compiler experts

Author: Robert Hyatt

Date: 17:45:32 02/18/04

Go up one level in this thread


On February 18, 2004 at 15:46:11, Dieter Buerssner wrote:

>On February 17, 2004 at 20:47:37, Robert Hyatt wrote:
>
>>On February 17, 2004 at 19:18:46, Dieter Buerssner wrote:
>>
>>>On February 17, 2004 at 16:10:10, Dann Corbit wrote:
>>>
>>>How did you create the source? Just the ouutput (more or less) of the pre
>>>processor run? It looks like it, but you still have the #include(s), which would
>>>not be in that output.
>>>
>>>Anyway, your assembly seems to strengthen my point. It does not look worse, than
>>>the "hand tuned" assembly code from (former versions of) Crafty.
>
>[snipped function, that is more suited to more populated BITBOARDS]
>
>
>>I can at least answer that question.  _Every_ time I have compared the asm to
>>any C implementation of those functions, the asm has won _every_ time.
>
>Did you try tghe C-code I posted? It is clear, that your generic PopCnt() is
>slower on 32-bit architectures.I just tried, with Gcc under Linux for Crafty
>19.10 and the C-code I posted was faster - 839 kn/s vs. 830 kn/s for the Crafty
>bench command. Numbers seem to be very reproducable (I did 3 runs for either).
>
>For example with the code as downloaded just 10 minutes ago from your ftp:
>
>Total nodes: 74698851
>Raw nodes per second: 829987
>
>And with the following change (commented out the inline assembly for PopCnt and
>used my C-code):
>
>#if 0
>int static __inline__ PopCnt(BITBOARD word)
>{
>/*  r0=result, %1=tmp, %2=first input, %3=second input */
>  long      dummy, dummy2;
>
>asm("        xorl    %0, %0"                    "\n\t"
>    "        testl   %2, %2"                    "\n\t"
>    "        jz      2f"                        "\n\t"
>    "1:      leal    -1(%2), %1"                "\n\t"
>    "        incl    %0"                        "\n\t"
>    "        andl    %1, %2"                    "\n\t"
>    "        jnz     1b"                        "\n\t"
>    "2:      testl   %3, %3"                    "\n\t"
>    "        jz      4f"                        "\n\t"
>    "3:      leal    -1(%3), %1"                "\n\t"
>    "        incl    %0"                        "\n\t"
>    "        andl    %1, %3"                    "\n\t"
>    "        jnz     3b"                        "\n\t"
>    "4:"                                        "\n\t"
>  : "=&q" (dummy), "=&q" (dummy2)
>  : "q" ((int) (word>>32)), "q" ((int) word)
>  : "cc");
>  return (dummy);
>}
>#else
>int static __inline__ PopCnt(BITBOARD a)
>{
>  unsigned long w;
>  int n = 0;
>  w = *(unsigned long *)&a;
>  if (w)
>    do
>      n++;
>    while ((w &= w-1) != 0);
>  w = *(((unsigned long *)&a)+1);
>  if (w)
>    do
>      n++;
>    while ((w &= w-1) != 0);
>  return n;
>}
>#endif
>
>I got:
>
>Total nodes: 74698851
>Raw nodes per second: 839312
>
>
>
>>In
>>Crafty.  I'll mention that for PopCnt() my bitboards are sparsely populated when
>>I do a PopCnt().  I have _never_ found any C code that will out-perform the old
>>X86.s code.
>
>But it is rather obvious, that your generic PopCnt will be slower on 32-bit
>architectures - not? In the assembly, you work on 2 32-bit words, while in C you
>work always on 64-bit words. The C-code I suggested will also work on 2 32 bit
>words and will produce essentially your inline assembly. But has the advantage,
>that it will run unchanged very efficiently on many platforms Crafty supports
>(independent of the specific architecture, compiler, ...). On the 64-bit
>platforms, of course the loop on 64-bit words will be faster. BTW. I think my C
>code will often (or even typically) use one register less.
>
>Regards,
>Dieter


How are you testing?  IE when I use intel's compiler, with PGO, the inline is
faster here.  Not significantly, but still faster...




This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.