Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Question for the Crafty/Compiler experts

Author: Dieter Buerssner

Date: 12:46:11 02/18/04

Go up one level in this thread


On February 17, 2004 at 20:47:37, Robert Hyatt wrote:

>On February 17, 2004 at 19:18:46, Dieter Buerssner wrote:
>
>>On February 17, 2004 at 16:10:10, Dann Corbit wrote:
>>
>>How did you create the source? Just the ouutput (more or less) of the pre
>>processor run? It looks like it, but you still have the #include(s), which would
>>not be in that output.
>>
>>Anyway, your assembly seems to strengthen my point. It does not look worse, than
>>the "hand tuned" assembly code from (former versions of) Crafty.

[snipped function, that is more suited to more populated BITBOARDS]


>I can at least answer that question.  _Every_ time I have compared the asm to
>any C implementation of those functions, the asm has won _every_ time.

Did you try tghe C-code I posted? It is clear, that your generic PopCnt() is
slower on 32-bit architectures.I just tried, with Gcc under Linux for Crafty
19.10 and the C-code I posted was faster - 839 kn/s vs. 830 kn/s for the Crafty
bench command. Numbers seem to be very reproducable (I did 3 runs for either).

For example with the code as downloaded just 10 minutes ago from your ftp:

Total nodes: 74698851
Raw nodes per second: 829987

And with the following change (commented out the inline assembly for PopCnt and
used my C-code):

#if 0
int static __inline__ PopCnt(BITBOARD word)
{
/*  r0=result, %1=tmp, %2=first input, %3=second input */
  long      dummy, dummy2;

asm("        xorl    %0, %0"                    "\n\t"
    "        testl   %2, %2"                    "\n\t"
    "        jz      2f"                        "\n\t"
    "1:      leal    -1(%2), %1"                "\n\t"
    "        incl    %0"                        "\n\t"
    "        andl    %1, %2"                    "\n\t"
    "        jnz     1b"                        "\n\t"
    "2:      testl   %3, %3"                    "\n\t"
    "        jz      4f"                        "\n\t"
    "3:      leal    -1(%3), %1"                "\n\t"
    "        incl    %0"                        "\n\t"
    "        andl    %1, %3"                    "\n\t"
    "        jnz     3b"                        "\n\t"
    "4:"                                        "\n\t"
  : "=&q" (dummy), "=&q" (dummy2)
  : "q" ((int) (word>>32)), "q" ((int) word)
  : "cc");
  return (dummy);
}
#else
int static __inline__ PopCnt(BITBOARD a)
{
  unsigned long w;
  int n = 0;
  w = *(unsigned long *)&a;
  if (w)
    do
      n++;
    while ((w &= w-1) != 0);
  w = *(((unsigned long *)&a)+1);
  if (w)
    do
      n++;
    while ((w &= w-1) != 0);
  return n;
}
#endif

I got:

Total nodes: 74698851
Raw nodes per second: 839312



>In
>Crafty.  I'll mention that for PopCnt() my bitboards are sparsely populated when
>I do a PopCnt().  I have _never_ found any C code that will out-perform the old
>X86.s code.

But it is rather obvious, that your generic PopCnt will be slower on 32-bit
architectures - not? In the assembly, you work on 2 32-bit words, while in C you
work always on 64-bit words. The C-code I suggested will also work on 2 32 bit
words and will produce essentially your inline assembly. But has the advantage,
that it will run unchanged very efficiently on many platforms Crafty supports
(independent of the specific architecture, compiler, ...). On the 64-bit
platforms, of course the loop on 64-bit words will be faster. BTW. I think my C
code will often (or even typically) use one register less.

Regards,
Dieter






This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.