Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Opteron Instruction Set

Author: Gerd Isenberg
Date: 13:20:49 02/03/04
On February 03, 2004 at 15:59:43, Vincent Diepeveen wrote:

>On February 03, 2004 at 15:02:42, Gerd Isenberg wrote:
>
>>On February 03, 2004 at 11:45:20, Vincent Diepeveen wrote:
>>
>>>On February 03, 2004 at 03:13:29, Gerd Isenberg wrote:
>>>
>>>>On February 03, 2004 at 01:03:29, Jay Urbanski wrote:
>>>>
>>>>>On February 02, 2004 at 22:41:19, Robert Hyatt wrote:
>>>>>
>>>>>>On February 02, 2004 at 20:06:29, David Rasmussen wrote:
>>>>>>
>>>>>>>Does the Opteron have firstBit, lastBit and popCount instructions? Or at least
>>>>>>>something that makes calculating them easier than on x86-32?
>>>>>>>
>>>>>>>/David
>>>>>>
>>>>>>
>>>>>>Has the same BSF/BSR instructions, but no popcnt that I have found.  Note
>>>>>>that BSF/BSR work on 64 bit values if you want.  I have inline asm to do
>>>>>>all three for gcc if you are interested.
>>>>>
>>>>>I understand there is a popcount instruction.  I also understand it's
>>>>>undocumented.
>>>>
>>>>Do you have any opcode or further hints?
>>>>That would be great - a 4 cycle vector path popcount ;-)
>>>
>>>And deadslow.
>>
>>Yes Vincent, if it exists, 4 is quite too optimistic. I guess it is more in a
>>range of 10-40 cycles, bsf is 9. And doing up to four popcounts in parallel as i
>>often do with MMX and/or general purpose is probably faster than using 4
>>deadslow vector path instructions in a row.
>>
>>My current SSE2 favourite is:
>>
>>MASKMOVDQU xmmreg1, xmmreg2 66h 0Fh F7h VectorPath ~ 43 cycles latency ;-)
>>(implements a masked conditional write of up to 16 bytes).
>>
>>But a very interesting SSE2 instruction for eval purposes is:
>>
>>PMADDWD Packed Multiply Words and Add Doublewords
>>Eight 16*16 muls and four 32-bit adds in 4 cycles (double dispatch as most SSE2
>>instructions):
>>
>>c0 = a0*b0 + a1*b1
>>c1 = a2*b2 + a3*b3
>>c2 = a4*b4 + a5*b5
>>c3 = a6*b6 + a7*b7
>
>Is each 16 bits word a real word, so signed integer [-32768..32767]?
>Or is it unsigned integer?

yes, signed 16-bit words:

There is only one case in which the result of the multiplication and addition
will not fit in a signed 32-bit destination. If all four of the 16-bit source
operands used to produce a 32-bit multiply-add result have the value 8000h, the
32-bit result is 8000_0000h, which is incorrect.

The problem with SSE is the final horiconal add - requires some shuffling.
But if you have huge eval vectors you want to multiply with game state depending
weighting vectors...

>
>If you want to write a graphics program it no doubt for a certain application
>will be great there to have something that is similar to that. However what you
>need for real graphics software is 32 bits floats.
>

Of course you have the default 4 * 32-bit float or 2 * 64-bit double simd
representation of the 16 xmm registers with these redundant instructions for
each of the three data types.

Under windos for AMD64, SSE becomes the default float/double register file
instead of x87, which shared with MMX, is not saved/restored during context
switch and therefore not usable.


>That's more interesting for graphics software.
>
>Yes it's great instructions for those who want to write their software in those
>worlds.
>
>However anyone who is in the integer world and writing game tree searching
>products must be liking to waste time to SSE* and stuff like PNI (Prescott New
>Instructions) :)
>
>Did you see diep tested at prescott already? www.aceshardware.com

ahh no, will have a look. A Diep benchmark on different hardware?

>
>Hopefully you optimized your program a little so that it is 10 times faster in
>nps than diep. DIEP will be around 500k nps at 8 processors.

oups no chance yet, i still play with my mmx one on Athlon32 - only 300-500K but
with huge eval ;-)


>
>See you there.
>
>>See you in Paderborn!
>>
>>Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.