Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Reverse Bitboards

Author: Gerd Isenberg

Date: 11:54:13 01/14/03

Go up one level in this thread


On January 14, 2003 at 12:52:58, Matt Taylor wrote:

>On January 13, 2003 at 20:17:36, Russell Reagan wrote:
>
>>On January 13, 2003 at 18:30:05, Matt Taylor wrote:
>>
>>>I think the real bottleneck would be the misjudgement of the speed of MMX. It is
>>>not as fast to respond as the integer units, though it maintains similar
>>>throughput. Using MMX for 64-bit arithmetic is not worthwhile as the same
>>>operations are available from the integer unit with lower setup costs. The only
>>>advantages include a minor gain in parallelism in hand-tweaked code and
>>>additional register space.
>>
>>Apparently if you use MMX correctly, it will be significantly faster than the
>>corresponding routine written in C (if it relies on 64-bit operations). The
>>primary example that comes to mind is that Gerd uses MMX in IsiChess to do
>>64-bit operations in the KoggeStone algorithms. He said it gave him a small
>>speed increase. Compare that with the same routines written in C, and the C
>>routines will be significantly slower. I know this because I wrote a program
>>using those routines in C and it got about 70 knps (compare with Crafty
>>300-500knps), and all it did was alpha-beta, material + mobility eval, and
>>nothing else. I tried several bitboard implementations, and the common factor in
>>the slow ones was the C KoggeStone attack generation. Gerd didn't experience
>>such a significant speed hit when he used his MMX routines. So it does appear
>>that there is a misjudgement of the speed of using MMX, but I'm not sure whether
>>it is an underestimation or overestimation.
>
>MMX is probably faster than straight C in some cases, but if you write the
>64-bit stuff in assembly using the main integer instructions, it will almost
>always be faster.

Hi Matt,

No - not on 32 bit hardware, or do you already mention Athlon64?
Fill-Stuff like KoggeStone and dumb7fill is one of these cases. It gains a lot
from available 64-bit or better 128-bit registers and doing parallel fills in
several directions.

One may use MMX- or hammers gp-registers for propagator computation (based on
empty squares) and SSE2 for a doubled generator, eg. for black/white or to get
two disjoint attack sets for sliders simultaniously.


>The latency of an ALU instruction
>(bitwise/arithmatic/conditional) is 1, and it has been ever since the 486. The
>latency for similar arithmatic MMX instructions on my Athlon is 2 clocks, and on
>a Pentium 4 it is 2 or worse. On the same processors, you can do 64-bit
>operations usually in 1 clock.


Yes, but my current getQueenAttacks, based on KoggeStone, dumb7fill and x^x-2
has an effective instruction latency of 0.5 cycles (up to four MMX-instruction
per time!).

>
>The only advantage to MMX is the extra registers you now have access to, but in
>my experiences code rarely saturates more than one of the 3 instruction sets
>(integer, FP, vector). Furthermore, movement of data between MMX registers and
>integers is horrifically slow, and if you mix with floating-point, you have to
>execute another slow instruction -- emms.
>

Yes, but chess engines don't use floating point instructions so much, i guess.
The vector path movd is really slow, i'll hope this becomes better with hammer.
Currently i use a final movq via aligned memory.


>I think greater performance can be achieved in hand-tweaked, purely-integer
>assembly. Unfortunately I do not have time right now to prove that theory, but
>if I ever get a chance, I will be sure to post some code.
>
>-Matt

Ok, try this one. My first, not optimal MMX-approach:

Cheers,
Gerd

------------------------------------------------------------------------------

BitBoard CNode::getQueenAttacks (UINT sq) const
{
	__asm
	{
		pcmpeqd mm6, mm6	; 0xffffffffffffffff
		pxor	mm1, mm1	; 0x0000000000000000
		mov	ecx, [this]
		psubd	mm1, mm6	; 0x0000000100000001
		movd	mm3, [sq]
		psrlq	mm1, 32
		psllq	mm1, mm3
		pxor	mm6, [ecx].m_Inc.m_PieceBB	; occupied -> empty
	}
	// input:   mm1 queens
	//	    mm6 emptySquares
	// output:  mm0 queenAttacks
	//============================
	__asm
	{	// up/down Kogge-Stone
		movq	mm2, mm1	; generator upg
		movq	mm4, mm1	; generator dng
		movq	mm5, mm6	; propagator upp
		movq	mm7, mm6	; propagator dnp
		psllq	mm2, 8		; upg << 8
		psrlq	mm4, 8		; dng >> 8
		psllq	mm5, 8		; upp << 8
		psrlq	mm7, 8		; dnp >> 8
		pand	mm2, mm6	; upp & (upg << 8)
		pand	mm4, mm6	; dnp & (dng >> 8)
		por	mm2, mm1	; upg |= upp & (upg << 8)
		por	mm4, mm1	; dng |= dnp & (dng >> 8)
		pand	mm5, mm6	; upp &= upp << 8
		pand	mm7, mm6	; dnp &= dnp >> 8
		movq	mm0, mm2	; upg
		movq	mm3, mm4	; dng
		psllq	mm2, 16		; upg << 16
		psrlq	mm4, 16		; dng >> 16
		pand	mm2, mm5	; upp & (upg << 16)
		pand	mm4, mm7	; dnp & (dng >> 16)
		por	mm2, mm0	; upg |= upp & (upg << 16)
		por	mm4, mm3	; dng |= dnp & (dng >> 16)
		movq	mm0, mm5	; upp
		movq	mm3, mm7	; dnp
		psllq	mm5, 16		; upp << 16
		psrlq	mm7, 16		; dnp >> 16
		pand	mm5, mm0	; upp &= p << 16
		movq	mm0, mm2	; upg
		pand	mm7, mm3	; dnp &= p >> 16
		movq	mm3, mm4	; dng
		psrlq	mm4, 32		; dng >> 32
		psllq	mm0, 32		; upg << 32
		pand	mm0, mm5	; upp & (upg << 32)
		// right with x^(x-2)
		pcmpeqd mm5, mm5	; 0xffffffffffffffff -1
		pand	mm4, mm7	; dnp & (dng >> 32)
		pcmpeqd mm7, mm7	; 0xffffffffffffffff -1
		por	mm0, mm2	; upg |= upp & (upg << 32)
		pxor	mm5, mm6	; not empty ==> occupied
		por	mm4, mm3	; dng |= dnp & (dng >> 32)
		por	mm5, mm1	; force queens as subset of occupied
		psllq	mm0, 8		; final shift up ==> queenAttacks
		psrlq	mm4, 8		; final shift dn
		movq	mm3, mm5	; occupied
		psubb	mm5, mm1	; occupied -   rooks
		psubb	mm5, mm1	; occupied - 2*rooks
		por	mm0, mm4	; queenAttacks |= down
		pxor	mm4, mm4	; 0x0000000000000000
		psubb	mm4, mm7	; 0x0101010101010101 0-(-1)
		pxor	mm5, mm3    ; right := occupied ^ (occupied - 2*rooks)
		// diagonals and left dumb7fill
		movq	mm2, mm1	; leftup
		psubb	mm7, mm4	; 0xfefefefefefefefe notA 0xff-0x01
		por	mm0, mm5	; queenAttacks |= right
		movq	mm5, mm7	; 0xfefefefefefefefe notA
		pand	mm2, mm7	; clear left occupied or a file
		psrlq	mm5, 1		; 0x7f7f7f7f7f7f7f7f notH
		movq	mm3, mm2	; leftdown
		pand	mm1, mm5	; clear right occupied or h file
		pand	mm7, mm6	; to clear left occupied or a-file
		pand	mm6, mm5	; to clear right occupied or h-file
		movq	mm4, mm1	; rightdown
		movq	mm5, mm2	; left
		// 1. fill diagonals and left
		psllq	mm1, 9		; rightup
		psrlq	mm4, 7		; rightdown
		psllq	mm2, 7		; leftup
		psrlq	mm3, 9		; leftdown
		psrlq	mm5, 1		; left
		por	mm0, mm1	; queenAttacks |= rightup
		por	mm0, mm4	; queenAttacks |= rightdown
		por	mm0, mm2	; queenAttacks |= leftup
		por	mm0, mm3	; queenAttacks |= leftdown
		por	mm0, mm5	; queenAttacks |= left
		pand	mm1, mm6	; clear rightup occupied or h file
		pand	mm4, mm6	; clear rightdown occupied or h file
		pand	mm2, mm7	; clear leftup occupied or a file
		pand	mm3, mm7	; clear leftdown occupied or a file
		pand	mm5, mm7	; clear left occupied or a file
		// 2. fill diagonals and left
		psllq	mm1, 9		; rightup
		psrlq	mm4, 7		; rightdown
		psllq	mm2, 7		; leftup
		psrlq	mm3, 9		; leftdown
		psrlq	mm5, 1		; left
		por	mm0, mm1	; queenAttacks |= rightup
		por	mm0, mm4	; queenAttacks |= rightdown
		por	mm0, mm2	; queenAttacks |= leftup
		por	mm0, mm3	; queenAttacks |= leftdown
		por	mm0, mm5	; queenAttacks |= left
		pand	mm1, mm6	; clear rightup occupied or h file
		pand	mm4, mm6	; clear rightdown occupied or h file
		pand	mm2, mm7	; clear leftup occupied or a file
		pand	mm3, mm7	; clear leftdown occupied or a file
		pand	mm5, mm7	; clear left occupied or a file
		// 3. fill diagonals and left
		psllq	mm1, 9		; rightup
		psrlq	mm4, 7		; rightdown
		psllq	mm2, 7		; leftup
		psrlq	mm3, 9		; leftdown
		psrlq	mm5, 1		; left
		por	mm0, mm1	; queenAttacks |= rightup
		por	mm0, mm4	; queenAttacks |= rightdown
		por	mm0, mm2	; queenAttacks |= leftup
		por	mm0, mm3	; queenAttacks |= leftdown
		por	mm0, mm5	; queenAttacks |= left
		pand	mm1, mm6	; clear rightup occupied or h file
		pand	mm4, mm6	; clear rightdown occupied or h file
		pand	mm2, mm7	; clear leftup occupied or a file
		pand	mm3, mm7	; clear leftdown occupied or a file
		pand	mm5, mm7	; clear left occupied or a file
		// 4. fill diagonals and left
		psllq	mm1, 9		; rightup
		psrlq	mm4, 7		; rightdown
		psllq	mm2, 7		; leftup
		psrlq	mm3, 9		; leftdown
		psrlq	mm5, 1		; left
		por	mm0, mm1	; queenAttacks |= rightup
		por	mm0, mm4	; queenAttacks |= rightdown
		por	mm0, mm2	; queenAttacks |= leftup
		por	mm0, mm3	; queenAttacks |= leftdown
		por	mm0, mm5	; queenAttacks |= left
		pand	mm1, mm6	; clear rightup occupied or h file
		pand	mm4, mm6	; clear rightdown occupied or h file
		pand	mm2, mm7	; clear leftup occupied or a file
		pand	mm3, mm7	; clear leftdown occupied or a file
		pand	mm5, mm7	; clear left occupied or a file
		// 5. fill diagonals and left
		psllq	mm1, 9		; rightup
		psrlq	mm4, 7		; rightdown
		psllq	mm2, 7		; leftup
		psrlq	mm3, 9		; leftdown
		psrlq	mm5, 1		; left
		por	mm0, mm1	; queenAttacks |= rightup
		por	mm0, mm4	; queenAttacks |= rightdown
		por	mm0, mm2	; queenAttacks |= leftup
		por	mm0, mm3	; queenAttacks |= leftdown
		por	mm0, mm5	; queenAttacks |= left
		pand	mm1, mm6	; clear rightup occupied or h file
		pand	mm4, mm6	; clear rightdown occupied or h file
		pand	mm2, mm7	; clear leftup occupied or a file
		pand	mm3, mm7	; clear leftdown occupied or a file
		pand	mm5, mm7	; clear left occupied or a file
		// 6. fill diagonals and left
		psllq	mm1, 9		; rightup
		psrlq	mm4, 7		; rightdown
		psllq	mm2, 7		; leftup
		psrlq	mm3, 9		; leftdown
		psrlq	mm5, 1		; left
		por	mm0, mm1	; queenAttacks |= rightup
		por	mm0, mm4	; queenAttacks |= rightdown
		por	mm0, mm2	; queenAttacks |= leftup
		por	mm0, mm3	; queenAttacks |= leftdown
		por	mm0, mm5	; queenAttacks |= left
		pand	mm1, mm6	; clear rightup occupied or h file
		pand	mm4, mm6	; clear rightdown occupied or h file
		pand	mm2, mm7	; clear leftup occupied or a file
		pand	mm3, mm7	; clear leftdown occupied or a file
		pand	mm5, mm7	; clear left occupied or a file
		// 7. fill diagonals and left
		psllq	mm1, 9		; rightup
		psrlq	mm4, 7		; rightdown
		psllq	mm2, 7		; leftup
		psrlq	mm3, 9		; leftdown
		psrlq	mm5, 1		; left
		por	mm0, mm1	; queenAttacks |= rightup
		por	mm0, mm4	; queenAttacks |= rightdown
		por	mm0, mm2	; queenAttacks |= leftup
		por	mm0, mm3	; queenAttacks |= leftdown
		por	mm0, mm5	; queenAttacks |= left
	}
	__asm
	{
		pswapd	mm1, mm0
		movd	eax, mm0
		movd	edx, mm1
	}
}




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.