Author: Gerd Isenberg
Date: 09:14:23 12/04/03
Go up one level in this thread
On December 04, 2003 at 12:06:35, Gerd Isenberg wrote:
>On December 04, 2003 at 11:26:59, Robert Hyatt wrote:
>
>>On December 04, 2003 at 10:38:42, Gerd Isenberg wrote:
>>
>>>On December 04, 2003 at 09:36:44, Robert Hyatt wrote:
>>>
>>>>On December 04, 2003 at 00:36:17, Eugene Nalimov wrote:
>>>>
>>>>>On December 03, 2003 at 23:58:52, Robert Hyatt wrote:
>>>>>
>>>>>>On December 03, 2003 at 22:51:43, Sune Fischer wrote:
>>>>>>
>>>>>>>On December 03, 2003 at 16:58:17, Russell Reagan wrote:
>>>>>>>
>>>>>>>>On December 03, 2003 at 16:35:46, Slater Wold wrote:
>>>>>>>>
>>>>>>>>>What's the speedup between 1, 2, and 4 CPUs?
>>>>>>>>
>>>>>>>>After they (Bob and Eugene) did the NUMA stuff for Windows, 4 cpus was like a
>>>>>>>>3.84x speedup.
>>>>>>>>
>>>>>>>>>Any idea on the speedup of going
>>>>>>>>>to 64-bit?
>>>>>>>>
>>>>>>>>Clock for clock, Crafty is about 60% faster on 64-bit hardware. IE a 2GHz
>>>>>>>>Opteron would run Crafty about 60% faster than a 2GHz 32-bit Athlon. Gian Carlo
>>>>>>>>reported that Sjeng ran 70% faster, clock for clock.
>>>>>>>
>>>>>>>The Opteron has lots of improvements other than the 64 bit thing, so it is still
>>>>>>>not exactly known what is contributing where for Crafty.
>>>>>>>
>>>>>>>I suspect Crafty would get a good speedup on a 32-bit Athlon too if it had 1 MB
>>>>>>>cache and more registers, this should somehow be factored out.
>>>>>>>
>>>>>>>Granted that's not easy to do, but if/when we manage to take a handfull of
>>>>>>>bitboard programs and compare their speedup to a handfull of non-bitboard
>>>>>>>programs, then we might get a better impression of how much the 64-bit thing is
>>>>>>>an issue on the overall.
>>>>>>>
>>>>>>>It is also possible that the first generation chess programs and compilers won't
>>>>>>>be optimal. First tests are often 'worst case' senarios.
>>>>>>>
>>>>>>>-S.
>>>>>>
>>>>>>
>>>>>>The thing that was most revealing was the 32 vs 64 bit stuff. Things like
>>>>>>FirstOne() are a bit messy on 32 bit machines. On the Opteron it is dirt
>>>>>>simple:
>>>>>>
>>>>>>int static __inline__ FirstOne(long word) {
>>>>>> long dummy, dummy2;
>>>>>> asm (
>>>>>> " bsrq %0, %1" "\n\t"
>>>>>> " jnz 1f" "\n\t"
>>>>>> " movq $-1, %1" "\n\t"
>>>>>> "1: movq $63, %0" "\n\t"
>>>>>> " subq %1, %0" "\n\t"
>>>>>> : "=r&" (dummy), "=r&" (dummy2)
>>>>>> : "0" ((long) (word))
>>>>>> : "cc");
>>>>>> return (dummy);
>>>>>>}
>>>>>>
>>>>>>bsrq is bsr for 64 bits. I use the "safe" version that does a test to see
>>>>>>if no bits were set. If so, I skip the move -1 to a register and leave that
>>>>>>register as set by bsfq. The 32 bit version is more than twice as long.
>>>>>>I will get rid of the jump with a cmovq later, but I just didn't feel like
>>>>>>fooling with it after I initially got it working.
>>>>>
>>>>>Here conditional move will be slower.
>>>>>
>>>>>Thanks,
>>>>>Eugene
>>>>
>>>>
>>>>OK, I'll bite. What's the explanation? :)
>>>
>>>~100% correct branch prediction, because it's most often or always used with
>>>none empty sets? I guess CMOV with todays branch prediction heuristics only pays
>>>off if conditions are really random.
>>
>>
>>OK. correct. I don't call these with zero masks intentionally, loops that
>>call to extract bits are generally in a while(mask) { x=FirstOne(mask); etc }
>>type loop.
>>
>>However, what makes CMOV slower than a correctly predicted branch, other than
>>the register dependency with the next instruction which needs the result of
>>the CMOV?
>
>higher latency of 4 cycles, one instruction less if correct predicted.
oups, wrong reg/reg is one cycle too, 4 is reg/mem!
>
> bsr rdx, rax
> jnz 1f ; 1 cycle
> mov edx, -1 ; skipped
>1: mov eax, 63 ; 1 cycle
> sub eax, rdx ; 1 cycle
>
>
> bsr rdx, rax
> mov eax, -1 ; 1 cycle
> cmovz edx, eax ; 1 cycle
> mov eax, 63 ; 1 cycle
> sub eax, rdx ; 1 cycle
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.