Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Assembly Programmers Challenge! REFORMULATION!

Author: David Rasmussen
Date: 04:38:36 01/21/03

What I meant was hoping for was x86 (Athlon XP, primarily) functions for _all_
or most of the below simple inline functions, since it seems that MSVC and Intel
generates horrible code (function calls for shifting etc.!) for these
fundamental functions. They still manage to be a lot faster than gcc, Borland
and Sun for some reasons.

For example: (see below)

On January 19, 2003 at 14:12:35, Gerd Isenberg wrote:

>>
>>(well these are some constants, but maybe they're intereting to some)
>>const BitBoard lightSquares = 0x55aa55aa | (BitBoard(0x55aa55aa) << 32);
>>const BitBoard darkSquares = ~lightSquares;
>>const BitBoard center = 0x18000000 | BitBoard(0x00000018) << 32;
>>
>>//INLINE BitBoard Mask(Square square) { return BitBoard(1) << square; }

This code is two loads (loads the bitboard) and a function call(!) on MSVC and
Intel. So, what is a fast assembly function to do the same? I am sure it can be
done faster.

>>INLINE BitBoard Mask(Square square) { return mask[square]; }
>>INLINE BitBoard RankMask(Rank rank) { return rankMask[rank]; }

rankmask is (as expected) a mask of all 1's at the relevant rank, so it's
11111111 shifted rank*8 times to the left, if rank is zero-indexed. This can
probably be done faster than a memory lookup too, if it's not put in the hands
of MSVC and Intel, which would probably just do the shift with a function call.
So, again: A faster assembly function should be possible. Please help, assembly
programmers!

>>INLINE BitBoard FileMask(File file) { return fileMask[file]; }

I don't know if this can be done fast. It's two shifts of two 32-bit words.
Help?

>>INLINE void Set(BitBoard& bitboard, Square square)
>>{ bitboard |= Mask(square); }

etc.

>>INLINE void Clear(BitBoard& bitboard, Square square)
>>{ bitboard &= ~Mask(square); }

etc.

>>INLINE void Toggle(BitBoard& bitboard, BitBoard mask)
>>{ bitboard ^= mask; }
>>

etc.

>>INLINE Square Rotate45Left(Square square) { return rotate45Left[square]; }
>>INLINE Square Rotate45Right(Square square) { return rotate45Right[square]; }
>>INLINE Square Rotate90Left(Square square) { return rotate90Left[square]; }
>>INLINE Square UnRotate45Left(Square square) { return unrotate45Left[square]; }

Maybe something can be done for these too, that is faster than a memory lookup?

In general, I would like assembly functions for all of these inline functions
above and below, that are faster than their originals on Intel and MSVC.

Pretty please?

Mask() is the most used of these so a fast formulation of that would gain some
speed for sure.

>>INLINE Square UnRotate45Right(Square square) { return unrotate45Right[square]; }
>>INLINE Square UnRotate90Left(Square square) { return unrotate90Left[square]; }
>>
>>INLINE void SetOccupied(Position& pos, Square square, Color color)
>>{
>>	Set(pos.occupied[color],square);
>>	Set(pos.occupiedRotate90Left,Rotate90Left(square));
>>	Set(pos.occupiedRotate45Left,Rotate45Left(square));
>>	Set(pos.occupiedRotate45Right,Rotate45Right(square));
>>}
>>
>>INLINE void ClearOccupied(Position &pos, Square square, Color color)
>>{
>>	Clear(pos.occupied[color],square);
>>	Clear(pos.occupiedRotate90Left,Rotate90Left(square));
>>	Clear(pos.occupiedRotate45Left,Rotate45Left(square));
>>	Clear(pos.occupiedRotate45Right,Rotate45Right(square));
>>}
>>
>>INLINE int RankShift(Square square) { return rankShift[square]; }
>>INLINE int FileShift(Square square) { return fileShift[square]; }
>>
>>INLINE int DiagonalShiftA1H8(Square square)
>>{ return diagonalShiftA1H8[square]; }
>>INLINE int DiagonalShiftH1A8(Square square)
>>{ return diagonalShiftH1A8[square]; }
>>
>>INLINE BitBoard Occupied(const Position &pos)
>>{ return pos.occupied[WHITE] | pos.occupied[BLACK]; }
>>
>>INLINE BitBoard FileAttack(const Position& pos, Square square)
>>{
>>	return fileAttack
>>		[square]
>>		[int((pos.occupiedRotate90Left >> FileShift(square)) & 0xFF)];
>>}
>>
>>INLINE BitBoard RankAttack(const Position& pos, Square square)
>>{
>>	return rankAttack
>>		[square]
>>		[int((Occupied(pos) >> RankShift(square)) & 0xFF)];
>>}
>>
>>INLINE BitBoard DiagonalAttackA1H8(const Position& pos, Square square)
>>{
>>	return diagonalAttackA1H8
>>		[square]
>>		[int((pos.occupiedRotate45Right
>>			>> DiagonalShiftA1H8(square)) & 0xFF)];
>>}
>>
>>INLINE BitBoard DiagonalAttackH1A8(const Position& pos, Square square)
>>{
>>	return diagonalAttackH1A8
>>		[square]
>>		[int((pos.occupiedRotate45Left
>>			>> DiagonalShiftH1A8(square)) & 0xFF)];
>>}
>>
>>INLINE BitBoard PawnAttack(Square square, Color color)
>>{ return pawnAttack[color][square]; }
>>
>>INLINE BitBoard KnightAttack(Square square)
>>{ return knightAttack[square]; }
>>
>>INLINE BitBoard BishopAttack(const Position &pos, Square square)
>>{ return DiagonalAttackA1H8(pos,square) | DiagonalAttackH1A8(pos,square); }
>>
>>INLINE BitBoard RookAttack(const Position& pos, Square square)
>>{ return RankAttack(pos,square) | FileAttack(pos,square); }
>>
>>INLINE BitBoard QueenAttack(const Position& pos, Square square)
>>{ return RookAttack(pos,square) | BishopAttack(pos,square); }
>>
>>INLINE BitBoard KingAttack(Square square)
>>{ return kingAttack[square]; }
>>
>
>Hi David,
>
>that's all fine. I used rotated with 64*64 instead of 64*256 array, because the
>outer occupied states don't matter, the inner six bit are enough.
>
>
>>I don't remember where I got these, I probably stole them or copied them from
>>discussions on CCC. Maybe they can be even faster:
>>
>>INLINE int FirstBit(const BitBoard bitboard)
>>{
>>	__asm
>>	{
>>		mov ecx,dword ptr [bitboard+4]
>>		mov ebx,dword ptr [bitboard]
>>		bsf eax,ecx
>>		add eax,32
>>		bsf eax,ebx
>>	}
>>}
>>
>>INLINE int LastBit(const BitBoard bitboard)
>>{
>>	__asm
>>	{
>>		mov ecx,dword ptr [bitboard]
>>		mov ebx,dword ptr [bitboard+4]
>>		bsr eax,ecx
>>		sub eax,32
>>		bsr eax,ebx
>>		add eax,32
>>	}
>>}
>
>
>Why the register loads, you can do bsf/bsf directly with memory source operands:
>
>		bsf		eax,[bitboard+4]
>		xor		eax, 32
>		bsf		eax,[bitboard]
>
>
>
>>
>>INLINE int PopCount(BitBoard a) // MMX
>>{
>>    static const __int64 C55 = 0x5555555555555555;
>>    static const __int64 C33 = 0x3333333333333333;
>>    static const __int64 C0F = 0x0F0F0F0F0F0F0F0F;
>>
>>    __asm {
>>        movd            mm0, word ptr a;
>>        punpckldq       mm0, word ptr a + 4;
>>        movq            mm1, mm0;
>>        psrld           mm0, 1;
>>        pand            mm0, [C55];
>>        psubd           mm1, mm0;
>>        movq            mm0, mm1;
>>        psrld           mm1, 2;
>>        pand            mm0, [C33];
>>        pand            mm1, [C33];
>>        paddd           mm0, mm1;
>>        movq            mm1, mm0;
>>        psrld           mm0, 4;
>>        paddd           mm0, mm1;
>>        pand            mm0, [C0F];
>>        pxor            mm1, mm1;
>>        psadbw mm0, mm1;
>>        movd            eax, mm0;
>>        emms;  femms for athlon is faster
>//  or skip emms at all, if you don't use float
>>    }
>>}
>
>
>I found this modified one slightly faster (saves a few bytes):
>
>Regards,
>Gerd
>
>---------------------------------------------------------------
>
>struct SBitCountConsts
>{
>	BitBoard C55;
>	BitBoard C33;
>	BitBoard C0F;
>        ...
>};
>extern const SBitCountConsts BitCountConsts;
>
>__forceinline
>int PopCount (BitBoard bb)
>{
>	__asm
>	{
>		movd  mm0, word ptr bb
>		punpckldq mm0, word ptr bb + 4
>		lea   eax, [BitCountConsts]
>		movq  mm1, mm0
>		psrld mm0, 1
>		pand  mm0, [eax].C55
>		psubd mm1, mm0
>		movq  mm0, mm1
>		psrld mm1, 2
>		pand  mm0, [eax].C33
>		pand  mm1, [eax].C33
>		paddd mm0, mm1
>		movq  mm1, mm0
>		psrld mm0, 4
>		paddd mm0, mm1
>		pand  mm0, [eax].C0F
>		pxor  mm1, mm1
>		psadbw mm0, mm1
>		movd  eax, mm0
>	}
>}

/David
Re: Assembly Programmers Challenge! REFORMULATION! Matt Taylor 23:28:43 01/21/03
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.