Author: Robert Hyatt
Date: 07:24:10 12/02/02
Go up one level in this thread
On December 02, 2002 at 05:08:28, Gerd Isenberg wrote: >On December 01, 2002 at 21:04:54, Arshad F. Syed wrote: > >>I am not much of an expert on this, since I am just starting to get into >>assembly programming. But I was curious about the dword used here, which implies >>to me you are doing 32-bit stuff. Wouldn't it make life simpler to just get an >>Itanium processor and switch to 64-bit code, which would cut down the cycles >>used. >> >>Regards, >>Arshad > >Hi Arshad, > >Yes, the PI2FD-Instruction does two parallel 32bit integer to 32bit float >conversions. This 3DNow! instruction will also be available with amd's hammer. > >Hammer has also the 64 bit CVTSI2SD or CVTSI2SS Instruction available (already >member of P4's SSE), which may be a good alternative if bsf/bsr is also so >relatively slow on hammer as on athlons (vector path instruction, blocking all >other pipes, huge execution latency), even if there is only one bsf reg64,reg64 >necessary. > >CVTSI2SD xmm, reg/mem64 F2 0F 2A /r converts a quadword integer in a >general-purpose register or 64-bit memory location to a double-precision >floating-point value in the destination XMM register. > >CVTSI2SS xmm, reg/mem64 F3 0F 2A /r converts a quadword integer in a >general-purpose register or 64-bit memory location to a single-precision >floating-point value in the destination XMM register. > >int getBitIndexOnHammer(BitBoard singleBit) >{ > __asm > { > mov rax, [singleBit] > CVTSI2SS xmm0, rax > movq? rax, xmm0 > shr rax, 23 > sub rax, 0x7f > and rax, 0x3f > } >} > >I find it amazing, that at least for doing two parallel bitscans, the popCount >approach or even more a fast integer/float conversion (2.5 as fast, if you >already have done (b&-b) ) outperforms clearly the bsf-pair for 64 bits on >Athlon. > >Gerd > Certainly means that AMD is not taking the "crypto" guys very seriously. :) They are the ones that led to all the cute stuff on the crays. Single instruction to count leading zeros (find first one bit is what it does) and the popcnt instruction. Leaving out a fast bsf/bsr means the crypto guys are not going to like the performance... Of course, they may not consider NSA-type folks as their primary market. Although those types spend a _bunch_ of "black money" yearly on their wild machines... >> >> >>On December 01, 2002 at 17:05:06, Gerd Isenberg wrote: >> >>>oups, something shorter and faster: >>> >>>int getBitIndex(BitBoard singleBit) >>>{ >>> __asm >>> { >>> pxor mm2, mm2 ; 0 >>> movd mm0, [singleBit] >>> punpckldq mm0, [singleBit+4] >>> pcmpeqd mm6, mm6 ; -1 >>> pxor mm7, mm7 ; 0 >>> pcmpeqd mm2, mm0 ; ~mask of the none zero dword >>> PI2FD mm1, mm0 ; 3f8..,400.. >>> pxor mm2, mm6 ; mask of the none zero dword >>> psrlq mm6, 63 ; 01 >>> psrld mm1, 23 ; 3f8 to 7f >>> psrld mm2, 25 ; 7f mask >>> psllq mm6, 32+5 ; 20:00 >>> psubd mm1, mm2 ; - 7f mask >>> por mm1, mm6 ; + 32 in high dword >>> pand mm1, mm2 ; & 7f mask >>> psadbw mm1, mm7 ; add all bytes >>> movd eax, mm1 >>> } >>>}
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.