Author: Gerd Isenberg
Date: 11:54:13 01/14/03
Go up one level in this thread
On January 14, 2003 at 12:52:58, Matt Taylor wrote:
>On January 13, 2003 at 20:17:36, Russell Reagan wrote:
>
>>On January 13, 2003 at 18:30:05, Matt Taylor wrote:
>>
>>>I think the real bottleneck would be the misjudgement of the speed of MMX. It is
>>>not as fast to respond as the integer units, though it maintains similar
>>>throughput. Using MMX for 64-bit arithmetic is not worthwhile as the same
>>>operations are available from the integer unit with lower setup costs. The only
>>>advantages include a minor gain in parallelism in hand-tweaked code and
>>>additional register space.
>>
>>Apparently if you use MMX correctly, it will be significantly faster than the
>>corresponding routine written in C (if it relies on 64-bit operations). The
>>primary example that comes to mind is that Gerd uses MMX in IsiChess to do
>>64-bit operations in the KoggeStone algorithms. He said it gave him a small
>>speed increase. Compare that with the same routines written in C, and the C
>>routines will be significantly slower. I know this because I wrote a program
>>using those routines in C and it got about 70 knps (compare with Crafty
>>300-500knps), and all it did was alpha-beta, material + mobility eval, and
>>nothing else. I tried several bitboard implementations, and the common factor in
>>the slow ones was the C KoggeStone attack generation. Gerd didn't experience
>>such a significant speed hit when he used his MMX routines. So it does appear
>>that there is a misjudgement of the speed of using MMX, but I'm not sure whether
>>it is an underestimation or overestimation.
>
>MMX is probably faster than straight C in some cases, but if you write the
>64-bit stuff in assembly using the main integer instructions, it will almost
>always be faster.
Hi Matt,
No - not on 32 bit hardware, or do you already mention Athlon64?
Fill-Stuff like KoggeStone and dumb7fill is one of these cases. It gains a lot
from available 64-bit or better 128-bit registers and doing parallel fills in
several directions.
One may use MMX- or hammers gp-registers for propagator computation (based on
empty squares) and SSE2 for a doubled generator, eg. for black/white or to get
two disjoint attack sets for sliders simultaniously.
>The latency of an ALU instruction
>(bitwise/arithmatic/conditional) is 1, and it has been ever since the 486. The
>latency for similar arithmatic MMX instructions on my Athlon is 2 clocks, and on
>a Pentium 4 it is 2 or worse. On the same processors, you can do 64-bit
>operations usually in 1 clock.
Yes, but my current getQueenAttacks, based on KoggeStone, dumb7fill and x^x-2
has an effective instruction latency of 0.5 cycles (up to four MMX-instruction
per time!).
>
>The only advantage to MMX is the extra registers you now have access to, but in
>my experiences code rarely saturates more than one of the 3 instruction sets
>(integer, FP, vector). Furthermore, movement of data between MMX registers and
>integers is horrifically slow, and if you mix with floating-point, you have to
>execute another slow instruction -- emms.
>
Yes, but chess engines don't use floating point instructions so much, i guess.
The vector path movd is really slow, i'll hope this becomes better with hammer.
Currently i use a final movq via aligned memory.
>I think greater performance can be achieved in hand-tweaked, purely-integer
>assembly. Unfortunately I do not have time right now to prove that theory, but
>if I ever get a chance, I will be sure to post some code.
>
>-Matt
Ok, try this one. My first, not optimal MMX-approach:
Cheers,
Gerd
------------------------------------------------------------------------------
BitBoard CNode::getQueenAttacks (UINT sq) const
{
__asm
{
pcmpeqd mm6, mm6 ; 0xffffffffffffffff
pxor mm1, mm1 ; 0x0000000000000000
mov ecx, [this]
psubd mm1, mm6 ; 0x0000000100000001
movd mm3, [sq]
psrlq mm1, 32
psllq mm1, mm3
pxor mm6, [ecx].m_Inc.m_PieceBB ; occupied -> empty
}
// input: mm1 queens
// mm6 emptySquares
// output: mm0 queenAttacks
//============================
__asm
{ // up/down Kogge-Stone
movq mm2, mm1 ; generator upg
movq mm4, mm1 ; generator dng
movq mm5, mm6 ; propagator upp
movq mm7, mm6 ; propagator dnp
psllq mm2, 8 ; upg << 8
psrlq mm4, 8 ; dng >> 8
psllq mm5, 8 ; upp << 8
psrlq mm7, 8 ; dnp >> 8
pand mm2, mm6 ; upp & (upg << 8)
pand mm4, mm6 ; dnp & (dng >> 8)
por mm2, mm1 ; upg |= upp & (upg << 8)
por mm4, mm1 ; dng |= dnp & (dng >> 8)
pand mm5, mm6 ; upp &= upp << 8
pand mm7, mm6 ; dnp &= dnp >> 8
movq mm0, mm2 ; upg
movq mm3, mm4 ; dng
psllq mm2, 16 ; upg << 16
psrlq mm4, 16 ; dng >> 16
pand mm2, mm5 ; upp & (upg << 16)
pand mm4, mm7 ; dnp & (dng >> 16)
por mm2, mm0 ; upg |= upp & (upg << 16)
por mm4, mm3 ; dng |= dnp & (dng >> 16)
movq mm0, mm5 ; upp
movq mm3, mm7 ; dnp
psllq mm5, 16 ; upp << 16
psrlq mm7, 16 ; dnp >> 16
pand mm5, mm0 ; upp &= p << 16
movq mm0, mm2 ; upg
pand mm7, mm3 ; dnp &= p >> 16
movq mm3, mm4 ; dng
psrlq mm4, 32 ; dng >> 32
psllq mm0, 32 ; upg << 32
pand mm0, mm5 ; upp & (upg << 32)
// right with x^(x-2)
pcmpeqd mm5, mm5 ; 0xffffffffffffffff -1
pand mm4, mm7 ; dnp & (dng >> 32)
pcmpeqd mm7, mm7 ; 0xffffffffffffffff -1
por mm0, mm2 ; upg |= upp & (upg << 32)
pxor mm5, mm6 ; not empty ==> occupied
por mm4, mm3 ; dng |= dnp & (dng >> 32)
por mm5, mm1 ; force queens as subset of occupied
psllq mm0, 8 ; final shift up ==> queenAttacks
psrlq mm4, 8 ; final shift dn
movq mm3, mm5 ; occupied
psubb mm5, mm1 ; occupied - rooks
psubb mm5, mm1 ; occupied - 2*rooks
por mm0, mm4 ; queenAttacks |= down
pxor mm4, mm4 ; 0x0000000000000000
psubb mm4, mm7 ; 0x0101010101010101 0-(-1)
pxor mm5, mm3 ; right := occupied ^ (occupied - 2*rooks)
// diagonals and left dumb7fill
movq mm2, mm1 ; leftup
psubb mm7, mm4 ; 0xfefefefefefefefe notA 0xff-0x01
por mm0, mm5 ; queenAttacks |= right
movq mm5, mm7 ; 0xfefefefefefefefe notA
pand mm2, mm7 ; clear left occupied or a file
psrlq mm5, 1 ; 0x7f7f7f7f7f7f7f7f notH
movq mm3, mm2 ; leftdown
pand mm1, mm5 ; clear right occupied or h file
pand mm7, mm6 ; to clear left occupied or a-file
pand mm6, mm5 ; to clear right occupied or h-file
movq mm4, mm1 ; rightdown
movq mm5, mm2 ; left
// 1. fill diagonals and left
psllq mm1, 9 ; rightup
psrlq mm4, 7 ; rightdown
psllq mm2, 7 ; leftup
psrlq mm3, 9 ; leftdown
psrlq mm5, 1 ; left
por mm0, mm1 ; queenAttacks |= rightup
por mm0, mm4 ; queenAttacks |= rightdown
por mm0, mm2 ; queenAttacks |= leftup
por mm0, mm3 ; queenAttacks |= leftdown
por mm0, mm5 ; queenAttacks |= left
pand mm1, mm6 ; clear rightup occupied or h file
pand mm4, mm6 ; clear rightdown occupied or h file
pand mm2, mm7 ; clear leftup occupied or a file
pand mm3, mm7 ; clear leftdown occupied or a file
pand mm5, mm7 ; clear left occupied or a file
// 2. fill diagonals and left
psllq mm1, 9 ; rightup
psrlq mm4, 7 ; rightdown
psllq mm2, 7 ; leftup
psrlq mm3, 9 ; leftdown
psrlq mm5, 1 ; left
por mm0, mm1 ; queenAttacks |= rightup
por mm0, mm4 ; queenAttacks |= rightdown
por mm0, mm2 ; queenAttacks |= leftup
por mm0, mm3 ; queenAttacks |= leftdown
por mm0, mm5 ; queenAttacks |= left
pand mm1, mm6 ; clear rightup occupied or h file
pand mm4, mm6 ; clear rightdown occupied or h file
pand mm2, mm7 ; clear leftup occupied or a file
pand mm3, mm7 ; clear leftdown occupied or a file
pand mm5, mm7 ; clear left occupied or a file
// 3. fill diagonals and left
psllq mm1, 9 ; rightup
psrlq mm4, 7 ; rightdown
psllq mm2, 7 ; leftup
psrlq mm3, 9 ; leftdown
psrlq mm5, 1 ; left
por mm0, mm1 ; queenAttacks |= rightup
por mm0, mm4 ; queenAttacks |= rightdown
por mm0, mm2 ; queenAttacks |= leftup
por mm0, mm3 ; queenAttacks |= leftdown
por mm0, mm5 ; queenAttacks |= left
pand mm1, mm6 ; clear rightup occupied or h file
pand mm4, mm6 ; clear rightdown occupied or h file
pand mm2, mm7 ; clear leftup occupied or a file
pand mm3, mm7 ; clear leftdown occupied or a file
pand mm5, mm7 ; clear left occupied or a file
// 4. fill diagonals and left
psllq mm1, 9 ; rightup
psrlq mm4, 7 ; rightdown
psllq mm2, 7 ; leftup
psrlq mm3, 9 ; leftdown
psrlq mm5, 1 ; left
por mm0, mm1 ; queenAttacks |= rightup
por mm0, mm4 ; queenAttacks |= rightdown
por mm0, mm2 ; queenAttacks |= leftup
por mm0, mm3 ; queenAttacks |= leftdown
por mm0, mm5 ; queenAttacks |= left
pand mm1, mm6 ; clear rightup occupied or h file
pand mm4, mm6 ; clear rightdown occupied or h file
pand mm2, mm7 ; clear leftup occupied or a file
pand mm3, mm7 ; clear leftdown occupied or a file
pand mm5, mm7 ; clear left occupied or a file
// 5. fill diagonals and left
psllq mm1, 9 ; rightup
psrlq mm4, 7 ; rightdown
psllq mm2, 7 ; leftup
psrlq mm3, 9 ; leftdown
psrlq mm5, 1 ; left
por mm0, mm1 ; queenAttacks |= rightup
por mm0, mm4 ; queenAttacks |= rightdown
por mm0, mm2 ; queenAttacks |= leftup
por mm0, mm3 ; queenAttacks |= leftdown
por mm0, mm5 ; queenAttacks |= left
pand mm1, mm6 ; clear rightup occupied or h file
pand mm4, mm6 ; clear rightdown occupied or h file
pand mm2, mm7 ; clear leftup occupied or a file
pand mm3, mm7 ; clear leftdown occupied or a file
pand mm5, mm7 ; clear left occupied or a file
// 6. fill diagonals and left
psllq mm1, 9 ; rightup
psrlq mm4, 7 ; rightdown
psllq mm2, 7 ; leftup
psrlq mm3, 9 ; leftdown
psrlq mm5, 1 ; left
por mm0, mm1 ; queenAttacks |= rightup
por mm0, mm4 ; queenAttacks |= rightdown
por mm0, mm2 ; queenAttacks |= leftup
por mm0, mm3 ; queenAttacks |= leftdown
por mm0, mm5 ; queenAttacks |= left
pand mm1, mm6 ; clear rightup occupied or h file
pand mm4, mm6 ; clear rightdown occupied or h file
pand mm2, mm7 ; clear leftup occupied or a file
pand mm3, mm7 ; clear leftdown occupied or a file
pand mm5, mm7 ; clear left occupied or a file
// 7. fill diagonals and left
psllq mm1, 9 ; rightup
psrlq mm4, 7 ; rightdown
psllq mm2, 7 ; leftup
psrlq mm3, 9 ; leftdown
psrlq mm5, 1 ; left
por mm0, mm1 ; queenAttacks |= rightup
por mm0, mm4 ; queenAttacks |= rightdown
por mm0, mm2 ; queenAttacks |= leftup
por mm0, mm3 ; queenAttacks |= leftdown
por mm0, mm5 ; queenAttacks |= left
}
__asm
{
pswapd mm1, mm0
movd eax, mm0
movd edx, mm1
}
}
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.