Author: Gerd Isenberg
Date: 11:54:13 01/14/03
Go up one level in this thread
On January 14, 2003 at 12:52:58, Matt Taylor wrote: >On January 13, 2003 at 20:17:36, Russell Reagan wrote: > >>On January 13, 2003 at 18:30:05, Matt Taylor wrote: >> >>>I think the real bottleneck would be the misjudgement of the speed of MMX. It is >>>not as fast to respond as the integer units, though it maintains similar >>>throughput. Using MMX for 64-bit arithmetic is not worthwhile as the same >>>operations are available from the integer unit with lower setup costs. The only >>>advantages include a minor gain in parallelism in hand-tweaked code and >>>additional register space. >> >>Apparently if you use MMX correctly, it will be significantly faster than the >>corresponding routine written in C (if it relies on 64-bit operations). The >>primary example that comes to mind is that Gerd uses MMX in IsiChess to do >>64-bit operations in the KoggeStone algorithms. He said it gave him a small >>speed increase. Compare that with the same routines written in C, and the C >>routines will be significantly slower. I know this because I wrote a program >>using those routines in C and it got about 70 knps (compare with Crafty >>300-500knps), and all it did was alpha-beta, material + mobility eval, and >>nothing else. I tried several bitboard implementations, and the common factor in >>the slow ones was the C KoggeStone attack generation. Gerd didn't experience >>such a significant speed hit when he used his MMX routines. So it does appear >>that there is a misjudgement of the speed of using MMX, but I'm not sure whether >>it is an underestimation or overestimation. > >MMX is probably faster than straight C in some cases, but if you write the >64-bit stuff in assembly using the main integer instructions, it will almost >always be faster. Hi Matt, No - not on 32 bit hardware, or do you already mention Athlon64? Fill-Stuff like KoggeStone and dumb7fill is one of these cases. It gains a lot from available 64-bit or better 128-bit registers and doing parallel fills in several directions. One may use MMX- or hammers gp-registers for propagator computation (based on empty squares) and SSE2 for a doubled generator, eg. for black/white or to get two disjoint attack sets for sliders simultaniously. >The latency of an ALU instruction >(bitwise/arithmatic/conditional) is 1, and it has been ever since the 486. The >latency for similar arithmatic MMX instructions on my Athlon is 2 clocks, and on >a Pentium 4 it is 2 or worse. On the same processors, you can do 64-bit >operations usually in 1 clock. Yes, but my current getQueenAttacks, based on KoggeStone, dumb7fill and x^x-2 has an effective instruction latency of 0.5 cycles (up to four MMX-instruction per time!). > >The only advantage to MMX is the extra registers you now have access to, but in >my experiences code rarely saturates more than one of the 3 instruction sets >(integer, FP, vector). Furthermore, movement of data between MMX registers and >integers is horrifically slow, and if you mix with floating-point, you have to >execute another slow instruction -- emms. > Yes, but chess engines don't use floating point instructions so much, i guess. The vector path movd is really slow, i'll hope this becomes better with hammer. Currently i use a final movq via aligned memory. >I think greater performance can be achieved in hand-tweaked, purely-integer >assembly. Unfortunately I do not have time right now to prove that theory, but >if I ever get a chance, I will be sure to post some code. > >-Matt Ok, try this one. My first, not optimal MMX-approach: Cheers, Gerd ------------------------------------------------------------------------------ BitBoard CNode::getQueenAttacks (UINT sq) const { __asm { pcmpeqd mm6, mm6 ; 0xffffffffffffffff pxor mm1, mm1 ; 0x0000000000000000 mov ecx, [this] psubd mm1, mm6 ; 0x0000000100000001 movd mm3, [sq] psrlq mm1, 32 psllq mm1, mm3 pxor mm6, [ecx].m_Inc.m_PieceBB ; occupied -> empty } // input: mm1 queens // mm6 emptySquares // output: mm0 queenAttacks //============================ __asm { // up/down Kogge-Stone movq mm2, mm1 ; generator upg movq mm4, mm1 ; generator dng movq mm5, mm6 ; propagator upp movq mm7, mm6 ; propagator dnp psllq mm2, 8 ; upg << 8 psrlq mm4, 8 ; dng >> 8 psllq mm5, 8 ; upp << 8 psrlq mm7, 8 ; dnp >> 8 pand mm2, mm6 ; upp & (upg << 8) pand mm4, mm6 ; dnp & (dng >> 8) por mm2, mm1 ; upg |= upp & (upg << 8) por mm4, mm1 ; dng |= dnp & (dng >> 8) pand mm5, mm6 ; upp &= upp << 8 pand mm7, mm6 ; dnp &= dnp >> 8 movq mm0, mm2 ; upg movq mm3, mm4 ; dng psllq mm2, 16 ; upg << 16 psrlq mm4, 16 ; dng >> 16 pand mm2, mm5 ; upp & (upg << 16) pand mm4, mm7 ; dnp & (dng >> 16) por mm2, mm0 ; upg |= upp & (upg << 16) por mm4, mm3 ; dng |= dnp & (dng >> 16) movq mm0, mm5 ; upp movq mm3, mm7 ; dnp psllq mm5, 16 ; upp << 16 psrlq mm7, 16 ; dnp >> 16 pand mm5, mm0 ; upp &= p << 16 movq mm0, mm2 ; upg pand mm7, mm3 ; dnp &= p >> 16 movq mm3, mm4 ; dng psrlq mm4, 32 ; dng >> 32 psllq mm0, 32 ; upg << 32 pand mm0, mm5 ; upp & (upg << 32) // right with x^(x-2) pcmpeqd mm5, mm5 ; 0xffffffffffffffff -1 pand mm4, mm7 ; dnp & (dng >> 32) pcmpeqd mm7, mm7 ; 0xffffffffffffffff -1 por mm0, mm2 ; upg |= upp & (upg << 32) pxor mm5, mm6 ; not empty ==> occupied por mm4, mm3 ; dng |= dnp & (dng >> 32) por mm5, mm1 ; force queens as subset of occupied psllq mm0, 8 ; final shift up ==> queenAttacks psrlq mm4, 8 ; final shift dn movq mm3, mm5 ; occupied psubb mm5, mm1 ; occupied - rooks psubb mm5, mm1 ; occupied - 2*rooks por mm0, mm4 ; queenAttacks |= down pxor mm4, mm4 ; 0x0000000000000000 psubb mm4, mm7 ; 0x0101010101010101 0-(-1) pxor mm5, mm3 ; right := occupied ^ (occupied - 2*rooks) // diagonals and left dumb7fill movq mm2, mm1 ; leftup psubb mm7, mm4 ; 0xfefefefefefefefe notA 0xff-0x01 por mm0, mm5 ; queenAttacks |= right movq mm5, mm7 ; 0xfefefefefefefefe notA pand mm2, mm7 ; clear left occupied or a file psrlq mm5, 1 ; 0x7f7f7f7f7f7f7f7f notH movq mm3, mm2 ; leftdown pand mm1, mm5 ; clear right occupied or h file pand mm7, mm6 ; to clear left occupied or a-file pand mm6, mm5 ; to clear right occupied or h-file movq mm4, mm1 ; rightdown movq mm5, mm2 ; left // 1. fill diagonals and left psllq mm1, 9 ; rightup psrlq mm4, 7 ; rightdown psllq mm2, 7 ; leftup psrlq mm3, 9 ; leftdown psrlq mm5, 1 ; left por mm0, mm1 ; queenAttacks |= rightup por mm0, mm4 ; queenAttacks |= rightdown por mm0, mm2 ; queenAttacks |= leftup por mm0, mm3 ; queenAttacks |= leftdown por mm0, mm5 ; queenAttacks |= left pand mm1, mm6 ; clear rightup occupied or h file pand mm4, mm6 ; clear rightdown occupied or h file pand mm2, mm7 ; clear leftup occupied or a file pand mm3, mm7 ; clear leftdown occupied or a file pand mm5, mm7 ; clear left occupied or a file // 2. fill diagonals and left psllq mm1, 9 ; rightup psrlq mm4, 7 ; rightdown psllq mm2, 7 ; leftup psrlq mm3, 9 ; leftdown psrlq mm5, 1 ; left por mm0, mm1 ; queenAttacks |= rightup por mm0, mm4 ; queenAttacks |= rightdown por mm0, mm2 ; queenAttacks |= leftup por mm0, mm3 ; queenAttacks |= leftdown por mm0, mm5 ; queenAttacks |= left pand mm1, mm6 ; clear rightup occupied or h file pand mm4, mm6 ; clear rightdown occupied or h file pand mm2, mm7 ; clear leftup occupied or a file pand mm3, mm7 ; clear leftdown occupied or a file pand mm5, mm7 ; clear left occupied or a file // 3. fill diagonals and left psllq mm1, 9 ; rightup psrlq mm4, 7 ; rightdown psllq mm2, 7 ; leftup psrlq mm3, 9 ; leftdown psrlq mm5, 1 ; left por mm0, mm1 ; queenAttacks |= rightup por mm0, mm4 ; queenAttacks |= rightdown por mm0, mm2 ; queenAttacks |= leftup por mm0, mm3 ; queenAttacks |= leftdown por mm0, mm5 ; queenAttacks |= left pand mm1, mm6 ; clear rightup occupied or h file pand mm4, mm6 ; clear rightdown occupied or h file pand mm2, mm7 ; clear leftup occupied or a file pand mm3, mm7 ; clear leftdown occupied or a file pand mm5, mm7 ; clear left occupied or a file // 4. fill diagonals and left psllq mm1, 9 ; rightup psrlq mm4, 7 ; rightdown psllq mm2, 7 ; leftup psrlq mm3, 9 ; leftdown psrlq mm5, 1 ; left por mm0, mm1 ; queenAttacks |= rightup por mm0, mm4 ; queenAttacks |= rightdown por mm0, mm2 ; queenAttacks |= leftup por mm0, mm3 ; queenAttacks |= leftdown por mm0, mm5 ; queenAttacks |= left pand mm1, mm6 ; clear rightup occupied or h file pand mm4, mm6 ; clear rightdown occupied or h file pand mm2, mm7 ; clear leftup occupied or a file pand mm3, mm7 ; clear leftdown occupied or a file pand mm5, mm7 ; clear left occupied or a file // 5. fill diagonals and left psllq mm1, 9 ; rightup psrlq mm4, 7 ; rightdown psllq mm2, 7 ; leftup psrlq mm3, 9 ; leftdown psrlq mm5, 1 ; left por mm0, mm1 ; queenAttacks |= rightup por mm0, mm4 ; queenAttacks |= rightdown por mm0, mm2 ; queenAttacks |= leftup por mm0, mm3 ; queenAttacks |= leftdown por mm0, mm5 ; queenAttacks |= left pand mm1, mm6 ; clear rightup occupied or h file pand mm4, mm6 ; clear rightdown occupied or h file pand mm2, mm7 ; clear leftup occupied or a file pand mm3, mm7 ; clear leftdown occupied or a file pand mm5, mm7 ; clear left occupied or a file // 6. fill diagonals and left psllq mm1, 9 ; rightup psrlq mm4, 7 ; rightdown psllq mm2, 7 ; leftup psrlq mm3, 9 ; leftdown psrlq mm5, 1 ; left por mm0, mm1 ; queenAttacks |= rightup por mm0, mm4 ; queenAttacks |= rightdown por mm0, mm2 ; queenAttacks |= leftup por mm0, mm3 ; queenAttacks |= leftdown por mm0, mm5 ; queenAttacks |= left pand mm1, mm6 ; clear rightup occupied or h file pand mm4, mm6 ; clear rightdown occupied or h file pand mm2, mm7 ; clear leftup occupied or a file pand mm3, mm7 ; clear leftdown occupied or a file pand mm5, mm7 ; clear left occupied or a file // 7. fill diagonals and left psllq mm1, 9 ; rightup psrlq mm4, 7 ; rightdown psllq mm2, 7 ; leftup psrlq mm3, 9 ; leftdown psrlq mm5, 1 ; left por mm0, mm1 ; queenAttacks |= rightup por mm0, mm4 ; queenAttacks |= rightdown por mm0, mm2 ; queenAttacks |= leftup por mm0, mm3 ; queenAttacks |= leftdown por mm0, mm5 ; queenAttacks |= left } __asm { pswapd mm1, mm0 movd eax, mm0 movd edx, mm1 } }
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.