Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Fast 3DNow! BitScan, one more faster

Author: Robert Hyatt
Date: 07:24:10 12/02/02
On December 02, 2002 at 05:08:28, Gerd Isenberg wrote:

>On December 01, 2002 at 21:04:54, Arshad F. Syed wrote:
>
>>I am not much of an expert on this, since I am just starting to get into
>>assembly programming. But I was curious about the dword used here, which implies
>>to me you are doing 32-bit stuff. Wouldn't it make life simpler to just get an
>>Itanium processor and switch to 64-bit code, which would cut down the cycles
>>used.
>>
>>Regards,
>>Arshad
>
>Hi Arshad,
>
>Yes, the PI2FD-Instruction does two parallel 32bit integer to 32bit float
>conversions. This 3DNow! instruction will also be available with amd's hammer.
>
>Hammer has also the 64 bit CVTSI2SD or CVTSI2SS Instruction available (already
>member of P4's SSE), which may be a good alternative if bsf/bsr is also so
>relatively slow on hammer as on athlons (vector path instruction, blocking all
>other pipes, huge execution latency), even if there is only one bsf reg64,reg64
>necessary.
>
>CVTSI2SD xmm, reg/mem64 F2 0F 2A /r converts a quadword integer in a
>general-purpose register or 64-bit memory location to a double-precision
>floating-point value in the destination XMM register.
>
>CVTSI2SS xmm, reg/mem64 F3 0F 2A /r converts a quadword integer in a
>general-purpose register or 64-bit memory location to a single-precision
>floating-point value in the destination XMM register.
>
>int getBitIndexOnHammer(BitBoard singleBit)
>{
>	__asm
>	{
>		mov	 rax, [singleBit]
>                CVTSI2SS xmm0, rax
>                movq?    rax, xmm0
>                shr      rax, 23
>                sub      rax, 0x7f
>                and      rax, 0x3f
>	}
>}
>
>I find it amazing, that at least for doing two parallel bitscans, the popCount
>approach or even more a fast integer/float conversion (2.5 as fast, if you
>already have done (b&-b) ) outperforms clearly the bsf-pair for 64 bits on
>Athlon.
>
>Gerd
>

Certainly means that AMD is not taking the "crypto" guys very seriously.  :)
They are the ones that led to all the cute stuff on the crays.  Single
instruction to count leading zeros (find first one bit is what it does)
and the popcnt instruction.

Leaving out a fast bsf/bsr means the crypto guys are not going to like the
performance...  Of course, they may not consider NSA-type folks as their
primary market.  Although those types spend a _bunch_ of "black money" yearly
on their wild machines...



>>
>>
>>On December 01, 2002 at 17:05:06, Gerd Isenberg wrote:
>>
>>>oups, something shorter and faster:
>>>
>>>int getBitIndex(BitBoard singleBit)
>>>{
>>>	__asm
>>>	{
>>>		pxor	mm2, mm2	; 0
>>>		movd		mm0, [singleBit]
>>>		punpckldq	mm0, [singleBit+4]
>>>		pcmpeqd	mm6, mm6	; -1
>>>		pxor	mm7, mm7	; 0
>>>		pcmpeqd	mm2, mm0	; ~mask of the none zero dword
>>>		PI2FD	mm1, mm0	; 3f8..,400..
>>>		pxor	mm2, mm6	; mask of the none zero dword
>>>		psrlq	mm6, 63		; 01
>>>		psrld	mm1, 23		; 3f8 to 7f
>>>		psrld	mm2, 25		; 7f mask
>>>		psllq	mm6, 32+5	; 20:00
>>>		psubd	mm1, mm2	; - 7f mask
>>>		por	mm1, mm6	; + 32 in high dword
>>>		pand	mm1, mm2	; & 7f mask
>>>		psadbw	mm1, mm7	; add all bytes
>>>		movd	eax, mm1
>>>	}
>>>}
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.