Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Simple quad-opteron test

Author: Robert Hyatt

Date: 09:27:00 12/04/03

Go up one level in this thread


On December 04, 2003 at 12:14:23, Gerd Isenberg wrote:

>On December 04, 2003 at 12:06:35, Gerd Isenberg wrote:
>
>>On December 04, 2003 at 11:26:59, Robert Hyatt wrote:
>>
>>>On December 04, 2003 at 10:38:42, Gerd Isenberg wrote:
>>>
>>>>On December 04, 2003 at 09:36:44, Robert Hyatt wrote:
>>>>
>>>>>On December 04, 2003 at 00:36:17, Eugene Nalimov wrote:
>>>>>
>>>>>>On December 03, 2003 at 23:58:52, Robert Hyatt wrote:
>>>>>>
>>>>>>>On December 03, 2003 at 22:51:43, Sune Fischer wrote:
>>>>>>>
>>>>>>>>On December 03, 2003 at 16:58:17, Russell Reagan wrote:
>>>>>>>>
>>>>>>>>>On December 03, 2003 at 16:35:46, Slater Wold wrote:
>>>>>>>>>
>>>>>>>>>>What's the speedup between 1, 2, and 4 CPUs?
>>>>>>>>>
>>>>>>>>>After they (Bob and Eugene) did the NUMA stuff for Windows, 4 cpus was like a
>>>>>>>>>3.84x speedup.
>>>>>>>>>
>>>>>>>>>>Any idea on the speedup of going
>>>>>>>>>>to 64-bit?
>>>>>>>>>
>>>>>>>>>Clock for clock, Crafty is about 60% faster on 64-bit hardware. IE a 2GHz
>>>>>>>>>Opteron would run Crafty about 60% faster than a 2GHz 32-bit Athlon. Gian Carlo
>>>>>>>>>reported that Sjeng ran 70% faster, clock for clock.
>>>>>>>>
>>>>>>>>The Opteron has lots of improvements other than the 64 bit thing, so it is still
>>>>>>>>not exactly known what is contributing where for Crafty.
>>>>>>>>
>>>>>>>>I suspect Crafty would get a good speedup on a 32-bit Athlon too if it had 1 MB
>>>>>>>>cache and more registers, this should somehow be factored out.
>>>>>>>>
>>>>>>>>Granted that's not easy to do, but if/when we manage to take a handfull of
>>>>>>>>bitboard programs and compare their speedup to a handfull of non-bitboard
>>>>>>>>programs, then we might get a better impression of how much the 64-bit thing is
>>>>>>>>an issue on the overall.
>>>>>>>>
>>>>>>>>It is also possible that the first generation chess programs and compilers won't
>>>>>>>>be optimal. First tests are often 'worst case' senarios.
>>>>>>>>
>>>>>>>>-S.
>>>>>>>
>>>>>>>
>>>>>>>The thing that was most revealing was the 32 vs 64 bit stuff.  Things like
>>>>>>>FirstOne() are a bit messy on 32 bit machines.  On the Opteron it is dirt
>>>>>>>simple:
>>>>>>>
>>>>>>>int static __inline__ FirstOne(long word) {
>>>>>>>        long dummy, dummy2;
>>>>>>>        asm (
>>>>>>>            "          bsrq    %0, %1"              "\n\t"
>>>>>>>            "          jnz     1f"                  "\n\t"
>>>>>>>            "          movq    $-1, %1"             "\n\t"
>>>>>>>            "1:        movq    $63, %0"             "\n\t"
>>>>>>>            "          subq    %1, %0"              "\n\t"
>>>>>>>            : "=r&" (dummy), "=r&" (dummy2)
>>>>>>>            : "0" ((long) (word))
>>>>>>>            : "cc");
>>>>>>>        return (dummy);
>>>>>>>}
>>>>>>>
>>>>>>>bsrq is bsr for 64 bits.  I use the "safe" version that does a test to see
>>>>>>>if no bits were set.  If so, I skip the move -1 to a register and leave that
>>>>>>>register as set by bsfq.  The 32 bit version is more than twice as long.
>>>>>>>I will get rid of the jump with a cmovq later, but I just didn't feel like
>>>>>>>fooling with it after I initially got it working.
>>>>>>
>>>>>>Here conditional move will be slower.
>>>>>>
>>>>>>Thanks,
>>>>>>Eugene
>>>>>
>>>>>
>>>>>OK, I'll bite. What's the explanation?  :)
>>>>
>>>>~100% correct branch prediction, because it's most often or always used with
>>>>none empty sets? I guess CMOV with todays branch prediction heuristics only pays
>>>>off if conditions are really random.
>>>
>>>
>>>OK.  correct.  I don't call these with zero masks intentionally, loops that
>>>call to extract bits are generally in a while(mask) { x=FirstOne(mask); etc }
>>>type loop.
>>>
>>>However, what makes CMOV slower than a correctly predicted branch, other than
>>>the register dependency with the next instruction which needs the result of
>>>the CMOV?
>>
>>higher latency of 4 cycles, one instruction less if correct predicted.
>
>oups, wrong reg/reg is one cycle too, 4 is reg/mem!


OK.  If I interpret that correctly, if I occasionally call this with a
zero value, the cmov might be better due to avoiding the mis-predicted
branch.  But if it is always non-zero, then the jnz is better.


>
>>
>>	bsr    rdx, rax
>>	jnz    1f		; 1 cycle
>>        mov    edx, -1		; skipped
>>1:	mov    eax, 63		; 1 cycle
>>	sub    eax, rdx		; 1 cycle
>>
>>
>>	bsr    rdx, rax
>>	mov    eax, -1		; 1 cycle
>>       cmovz  edx, eax		; 1 cycle
>>	mov    eax, 63		; 1 cycle
>>	sub    eax, rdx		; 1 cycle



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.