Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Fast 3DNow! BitScan, one more faster

Author: Gerd Isenberg

Date: 02:08:28 12/02/02

Go up one level in this thread


On December 01, 2002 at 21:04:54, Arshad F. Syed wrote:

>I am not much of an expert on this, since I am just starting to get into
>assembly programming. But I was curious about the dword used here, which implies
>to me you are doing 32-bit stuff. Wouldn't it make life simpler to just get an
>Itanium processor and switch to 64-bit code, which would cut down the cycles
>used.
>
>Regards,
>Arshad

Hi Arshad,

Yes, the PI2FD-Instruction does two parallel 32bit integer to 32bit float
conversions. This 3DNow! instruction will also be available with amd's hammer.

Hammer has also the 64 bit CVTSI2SD or CVTSI2SS Instruction available (already
member of P4's SSE), which may be a good alternative if bsf/bsr is also so
relatively slow on hammer as on athlons (vector path instruction, blocking all
other pipes, huge execution latency), even if there is only one bsf reg64,reg64
necessary.

CVTSI2SD xmm, reg/mem64 F2 0F 2A /r converts a quadword integer in a
general-purpose register or 64-bit memory location to a double-precision
floating-point value in the destination XMM register.

CVTSI2SS xmm, reg/mem64 F3 0F 2A /r converts a quadword integer in a
general-purpose register or 64-bit memory location to a single-precision
floating-point value in the destination XMM register.

int getBitIndexOnHammer(BitBoard singleBit)
{
	__asm
	{
		mov	 rax, [singleBit]
                CVTSI2SS xmm0, rax
                movq?    rax, xmm0
                shr      rax, 23
                sub      rax, 0x7f
                and      rax, 0x3f
	}
}

I find it amazing, that at least for doing two parallel bitscans, the popCount
approach or even more a fast integer/float conversion (2.5 as fast, if you
already have done (b&-b) ) outperforms clearly the bsf-pair for 64 bits on
Athlon.

Gerd

>
>
>On December 01, 2002 at 17:05:06, Gerd Isenberg wrote:
>
>>oups, something shorter and faster:
>>
>>int getBitIndex(BitBoard singleBit)
>>{
>>	__asm
>>	{
>>		pxor	mm2, mm2	; 0
>>		movd		mm0, [singleBit]
>>		punpckldq	mm0, [singleBit+4]
>>		pcmpeqd	mm6, mm6	; -1
>>		pxor	mm7, mm7	; 0
>>		pcmpeqd	mm2, mm0	; ~mask of the none zero dword
>>		PI2FD	mm1, mm0	; 3f8..,400..
>>		pxor	mm2, mm6	; mask of the none zero dword
>>		psrlq	mm6, 63		; 01
>>		psrld	mm1, 23		; 3f8 to 7f
>>		psrld	mm2, 25		; 7f mask
>>		psllq	mm6, 32+5	; 20:00
>>		psubd	mm1, mm2	; - 7f mask
>>		por	mm1, mm6	; + 32 in high dword
>>		pand	mm1, mm2	; & 7f mask
>>		psadbw	mm1, mm7	; add all bytes
>>		movd	eax, mm1
>>	}
>>}



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.