Author: Gerd Isenberg
Date: 13:20:49 02/03/04
Go up one level in this thread
On February 03, 2004 at 15:59:43, Vincent Diepeveen wrote: >On February 03, 2004 at 15:02:42, Gerd Isenberg wrote: > >>On February 03, 2004 at 11:45:20, Vincent Diepeveen wrote: >> >>>On February 03, 2004 at 03:13:29, Gerd Isenberg wrote: >>> >>>>On February 03, 2004 at 01:03:29, Jay Urbanski wrote: >>>> >>>>>On February 02, 2004 at 22:41:19, Robert Hyatt wrote: >>>>> >>>>>>On February 02, 2004 at 20:06:29, David Rasmussen wrote: >>>>>> >>>>>>>Does the Opteron have firstBit, lastBit and popCount instructions? Or at least >>>>>>>something that makes calculating them easier than on x86-32? >>>>>>> >>>>>>>/David >>>>>> >>>>>> >>>>>>Has the same BSF/BSR instructions, but no popcnt that I have found. Note >>>>>>that BSF/BSR work on 64 bit values if you want. I have inline asm to do >>>>>>all three for gcc if you are interested. >>>>> >>>>>I understand there is a popcount instruction. I also understand it's >>>>>undocumented. >>>> >>>>Do you have any opcode or further hints? >>>>That would be great - a 4 cycle vector path popcount ;-) >>> >>>And deadslow. >> >>Yes Vincent, if it exists, 4 is quite too optimistic. I guess it is more in a >>range of 10-40 cycles, bsf is 9. And doing up to four popcounts in parallel as i >>often do with MMX and/or general purpose is probably faster than using 4 >>deadslow vector path instructions in a row. >> >>My current SSE2 favourite is: >> >>MASKMOVDQU xmmreg1, xmmreg2 66h 0Fh F7h VectorPath ~ 43 cycles latency ;-) >>(implements a masked conditional write of up to 16 bytes). >> >>But a very interesting SSE2 instruction for eval purposes is: >> >>PMADDWD Packed Multiply Words and Add Doublewords >>Eight 16*16 muls and four 32-bit adds in 4 cycles (double dispatch as most SSE2 >>instructions): >> >>c0 = a0*b0 + a1*b1 >>c1 = a2*b2 + a3*b3 >>c2 = a4*b4 + a5*b5 >>c3 = a6*b6 + a7*b7 > >Is each 16 bits word a real word, so signed integer [-32768..32767]? >Or is it unsigned integer? yes, signed 16-bit words: There is only one case in which the result of the multiplication and addition will not fit in a signed 32-bit destination. If all four of the 16-bit source operands used to produce a 32-bit multiply-add result have the value 8000h, the 32-bit result is 8000_0000h, which is incorrect. The problem with SSE is the final horiconal add - requires some shuffling. But if you have huge eval vectors you want to multiply with game state depending weighting vectors... > >If you want to write a graphics program it no doubt for a certain application >will be great there to have something that is similar to that. However what you >need for real graphics software is 32 bits floats. > Of course you have the default 4 * 32-bit float or 2 * 64-bit double simd representation of the 16 xmm registers with these redundant instructions for each of the three data types. Under windos for AMD64, SSE becomes the default float/double register file instead of x87, which shared with MMX, is not saved/restored during context switch and therefore not usable. >That's more interesting for graphics software. > >Yes it's great instructions for those who want to write their software in those >worlds. > >However anyone who is in the integer world and writing game tree searching >products must be liking to waste time to SSE* and stuff like PNI (Prescott New >Instructions) :) > >Did you see diep tested at prescott already? www.aceshardware.com ahh no, will have a look. A Diep benchmark on different hardware? > >Hopefully you optimized your program a little so that it is 10 times faster in >nps than diep. DIEP will be around 500k nps at 8 processors. oups no chance yet, i still play with my mmx one on Athlon32 - only 300-500K but with huge eval ;-) > >See you there. > >>See you in Paderborn! >> >>Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.