Author: Gerd Isenberg
Date: 08:47:35 08/08/02
Go up one level in this thread
On August 08, 2002 at 08:37:48, Gian-Carlo Pascutto wrote:
>On August 08, 2002 at 08:11:13, Vincent Diepeveen wrote:
>
>
>>>This is very short, but in tests with my Athlon XP, it is no faster than the
>>>version with branches and checks for zeroed bitmaps.
>>
>>that's of course a dumb test to do. you should try each time a random
>>chosen bitmap.
>
>I think you misread what he wrote. He didn't test with zero bitmaps.
>
>The problem of the short routine is that bsf is very slow on the Athlon,
>and it can only be handled by the vector decoder.
>
>In a bitboard program, typical case is to have an empty half-bitboard, and
>avoiding the bsf is faster even if it's occasionally mispredicted.
>
>--
>GCP
I can confim this and was very surprised, that this (introduced by Tim and
slighly modified) was significant faster in average ...
// precondition: bb not null
__forceinline unsigned int BitSearchAndReset(BitBoard &bb)
{
__asm
{
xor edx, edx
mov ebx, [bb]
mov eax, edx
inc edx
bsf ecx, [ebx]
jnz found
bsf ecx, [ebx + 4]
lea ebx, [ebx + 4]
xor eax, 32
found:
shl edx, cl
xor eax, ecx
xor [ebx], edx
}
}
... than this short one, ...
// precondition: bb not null
__forceinline unsigned int BitSearchAndReset(BitBoard &bb)
{
__asm
{
mov edx, [bb]
bsf eax, [edx+4]
xor eax, 32
bsf eax, [edx]
btr [edx],eax
}
}
... even if it used 4 registers instead of two. Three vector path instructions
in a row seem to be real performance killers on athlon.
Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.