Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Simple quad-opteron test

Author: Eugene Nalimov

Date: 13:48:56 12/04/03

Go up one level in this thread


On December 04, 2003 at 09:36:44, Robert Hyatt wrote:

>On December 04, 2003 at 00:36:17, Eugene Nalimov wrote:
>
>>On December 03, 2003 at 23:58:52, Robert Hyatt wrote:
>>
>>>On December 03, 2003 at 22:51:43, Sune Fischer wrote:
>>>
>>>>On December 03, 2003 at 16:58:17, Russell Reagan wrote:
>>>>
>>>>>On December 03, 2003 at 16:35:46, Slater Wold wrote:
>>>>>
>>>>>>What's the speedup between 1, 2, and 4 CPUs?
>>>>>
>>>>>After they (Bob and Eugene) did the NUMA stuff for Windows, 4 cpus was like a
>>>>>3.84x speedup.
>>>>>
>>>>>>Any idea on the speedup of going
>>>>>>to 64-bit?
>>>>>
>>>>>Clock for clock, Crafty is about 60% faster on 64-bit hardware. IE a 2GHz
>>>>>Opteron would run Crafty about 60% faster than a 2GHz 32-bit Athlon. Gian Carlo
>>>>>reported that Sjeng ran 70% faster, clock for clock.
>>>>
>>>>The Opteron has lots of improvements other than the 64 bit thing, so it is still
>>>>not exactly known what is contributing where for Crafty.
>>>>
>>>>I suspect Crafty would get a good speedup on a 32-bit Athlon too if it had 1 MB
>>>>cache and more registers, this should somehow be factored out.
>>>>
>>>>Granted that's not easy to do, but if/when we manage to take a handfull of
>>>>bitboard programs and compare their speedup to a handfull of non-bitboard
>>>>programs, then we might get a better impression of how much the 64-bit thing is
>>>>an issue on the overall.
>>>>
>>>>It is also possible that the first generation chess programs and compilers won't
>>>>be optimal. First tests are often 'worst case' senarios.
>>>>
>>>>-S.
>>>
>>>
>>>The thing that was most revealing was the 32 vs 64 bit stuff.  Things like
>>>FirstOne() are a bit messy on 32 bit machines.  On the Opteron it is dirt
>>>simple:
>>>
>>>int static __inline__ FirstOne(long word) {
>>>        long dummy, dummy2;
>>>        asm (
>>>            "          bsrq    %0, %1"              "\n\t"
>>>            "          jnz     1f"                  "\n\t"
>>>            "          movq    $-1, %1"             "\n\t"
>>>            "1:        movq    $63, %0"             "\n\t"
>>>            "          subq    %1, %0"              "\n\t"
>>>            : "=r&" (dummy), "=r&" (dummy2)
>>>            : "0" ((long) (word))
>>>            : "cc");
>>>        return (dummy);
>>>}
>>>
>>>bsrq is bsr for 64 bits.  I use the "safe" version that does a test to see
>>>if no bits were set.  If so, I skip the move -1 to a register and leave that
>>>register as set by bsfq.  The 32 bit version is more than twice as long.
>>>I will get rid of the jump with a cmovq later, but I just didn't feel like
>>>fooling with it after I initially got it working.
>>
>>Here conditional move will be slower.
>>
>>Thanks,
>>Eugene
>
>
>OK, I'll bite. What's the explanation?  :)

There are several reasons why CMOV on x86 does not give performance boost
everyone is expecting. I'll list here all of them, though some may be not
applicable for your particular case:

(1) Good branch predictors in modern CPUs, especially when you inlined your
small function and now have the separate branch for each former call site. With
high probability bit pattern you are testing would be exactly the same it was
previous time at that particular address, meaning that branch would be easy to
predict.
(2) Extra data dependencies.
(3) There is no "CMOV reg, immed" on x86. That means that you have to load
immediate to another register first, causing extra data dependencies and using
extra register (later is especially noticeable on x86, where you have only 7
available registers).
(4) There is no 8-bit CMOV instruction.
(5) CMOV is not fully conditional instruction. "CMOV reg, mem" accesses the
memory even when condition is false -- meaning that you can get access violation
(i.e. you cannot replace any branch/move pair by conditional move).
(6) With CMOV you are increasing amount of executed code, especially for the
case when you have profile-guided optimizations. Let's look at you example. You
currently have

   jnz L         // 2 bytes
   mov ecx, 63   // 5 bytes
L:
    ...

(7 bytes total).

With profile-guided optimizations you'll have

    jz Not_L     // 6 bytes
L:
    ...

// Dead code segment
    ...
Not_L:
    mov eax, 63
    jmp L
    ...


You are executing slightly less code for normal case (6 bytes instead of 7).
More, on Opteron/Athlon64 branch that was never taken will be executed as NOPs,
not polluting branch predictor tables.

Now let's assume that you used conditional move instead:

    mov ebx, -1    // 5 bytes
    cmov ecx, ebx  // 3 bytes

I.e. you are executing 8 bytes instead of 6 or 7. When you start usings CMOVs
everywhere those bytes rapidly add together, resulting in worse L1 I-cache
utilization.

Thanks,
Eugene



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.