Author: David Rasmussen
Date: 04:38:36 01/21/03
Go up one level in this thread
What I meant was hoping for was x86 (Athlon XP, primarily) functions for _all_ or most of the below simple inline functions, since it seems that MSVC and Intel generates horrible code (function calls for shifting etc.!) for these fundamental functions. They still manage to be a lot faster than gcc, Borland and Sun for some reasons. For example: (see below) On January 19, 2003 at 14:12:35, Gerd Isenberg wrote: >> >>(well these are some constants, but maybe they're intereting to some) >>const BitBoard lightSquares = 0x55aa55aa | (BitBoard(0x55aa55aa) << 32); >>const BitBoard darkSquares = ~lightSquares; >>const BitBoard center = 0x18000000 | BitBoard(0x00000018) << 32; >> >>//INLINE BitBoard Mask(Square square) { return BitBoard(1) << square; } This code is two loads (loads the bitboard) and a function call(!) on MSVC and Intel. So, what is a fast assembly function to do the same? I am sure it can be done faster. >>INLINE BitBoard Mask(Square square) { return mask[square]; } >>INLINE BitBoard RankMask(Rank rank) { return rankMask[rank]; } rankmask is (as expected) a mask of all 1's at the relevant rank, so it's 11111111 shifted rank*8 times to the left, if rank is zero-indexed. This can probably be done faster than a memory lookup too, if it's not put in the hands of MSVC and Intel, which would probably just do the shift with a function call. So, again: A faster assembly function should be possible. Please help, assembly programmers! >>INLINE BitBoard FileMask(File file) { return fileMask[file]; } I don't know if this can be done fast. It's two shifts of two 32-bit words. Help? >>INLINE void Set(BitBoard& bitboard, Square square) >>{ bitboard |= Mask(square); } etc. >>INLINE void Clear(BitBoard& bitboard, Square square) >>{ bitboard &= ~Mask(square); } etc. >>INLINE void Toggle(BitBoard& bitboard, BitBoard mask) >>{ bitboard ^= mask; } >> etc. >>INLINE Square Rotate45Left(Square square) { return rotate45Left[square]; } >>INLINE Square Rotate45Right(Square square) { return rotate45Right[square]; } >>INLINE Square Rotate90Left(Square square) { return rotate90Left[square]; } >>INLINE Square UnRotate45Left(Square square) { return unrotate45Left[square]; } Maybe something can be done for these too, that is faster than a memory lookup? In general, I would like assembly functions for all of these inline functions above and below, that are faster than their originals on Intel and MSVC. Pretty please? Mask() is the most used of these so a fast formulation of that would gain some speed for sure. >>INLINE Square UnRotate45Right(Square square) { return unrotate45Right[square]; } >>INLINE Square UnRotate90Left(Square square) { return unrotate90Left[square]; } >> >>INLINE void SetOccupied(Position& pos, Square square, Color color) >>{ >> Set(pos.occupied[color],square); >> Set(pos.occupiedRotate90Left,Rotate90Left(square)); >> Set(pos.occupiedRotate45Left,Rotate45Left(square)); >> Set(pos.occupiedRotate45Right,Rotate45Right(square)); >>} >> >>INLINE void ClearOccupied(Position &pos, Square square, Color color) >>{ >> Clear(pos.occupied[color],square); >> Clear(pos.occupiedRotate90Left,Rotate90Left(square)); >> Clear(pos.occupiedRotate45Left,Rotate45Left(square)); >> Clear(pos.occupiedRotate45Right,Rotate45Right(square)); >>} >> >>INLINE int RankShift(Square square) { return rankShift[square]; } >>INLINE int FileShift(Square square) { return fileShift[square]; } >> >>INLINE int DiagonalShiftA1H8(Square square) >>{ return diagonalShiftA1H8[square]; } >>INLINE int DiagonalShiftH1A8(Square square) >>{ return diagonalShiftH1A8[square]; } >> >>INLINE BitBoard Occupied(const Position &pos) >>{ return pos.occupied[WHITE] | pos.occupied[BLACK]; } >> >>INLINE BitBoard FileAttack(const Position& pos, Square square) >>{ >> return fileAttack >> [square] >> [int((pos.occupiedRotate90Left >> FileShift(square)) & 0xFF)]; >>} >> >>INLINE BitBoard RankAttack(const Position& pos, Square square) >>{ >> return rankAttack >> [square] >> [int((Occupied(pos) >> RankShift(square)) & 0xFF)]; >>} >> >>INLINE BitBoard DiagonalAttackA1H8(const Position& pos, Square square) >>{ >> return diagonalAttackA1H8 >> [square] >> [int((pos.occupiedRotate45Right >> >> DiagonalShiftA1H8(square)) & 0xFF)]; >>} >> >>INLINE BitBoard DiagonalAttackH1A8(const Position& pos, Square square) >>{ >> return diagonalAttackH1A8 >> [square] >> [int((pos.occupiedRotate45Left >> >> DiagonalShiftH1A8(square)) & 0xFF)]; >>} >> >>INLINE BitBoard PawnAttack(Square square, Color color) >>{ return pawnAttack[color][square]; } >> >>INLINE BitBoard KnightAttack(Square square) >>{ return knightAttack[square]; } >> >>INLINE BitBoard BishopAttack(const Position &pos, Square square) >>{ return DiagonalAttackA1H8(pos,square) | DiagonalAttackH1A8(pos,square); } >> >>INLINE BitBoard RookAttack(const Position& pos, Square square) >>{ return RankAttack(pos,square) | FileAttack(pos,square); } >> >>INLINE BitBoard QueenAttack(const Position& pos, Square square) >>{ return RookAttack(pos,square) | BishopAttack(pos,square); } >> >>INLINE BitBoard KingAttack(Square square) >>{ return kingAttack[square]; } >> > >Hi David, > >that's all fine. I used rotated with 64*64 instead of 64*256 array, because the >outer occupied states don't matter, the inner six bit are enough. > > >>I don't remember where I got these, I probably stole them or copied them from >>discussions on CCC. Maybe they can be even faster: >> >>INLINE int FirstBit(const BitBoard bitboard) >>{ >> __asm >> { >> mov ecx,dword ptr [bitboard+4] >> mov ebx,dword ptr [bitboard] >> bsf eax,ecx >> add eax,32 >> bsf eax,ebx >> } >>} >> >>INLINE int LastBit(const BitBoard bitboard) >>{ >> __asm >> { >> mov ecx,dword ptr [bitboard] >> mov ebx,dword ptr [bitboard+4] >> bsr eax,ecx >> sub eax,32 >> bsr eax,ebx >> add eax,32 >> } >>} > > >Why the register loads, you can do bsf/bsf directly with memory source operands: > > bsf eax,[bitboard+4] > xor eax, 32 > bsf eax,[bitboard] > > > >> >>INLINE int PopCount(BitBoard a) // MMX >>{ >> static const __int64 C55 = 0x5555555555555555; >> static const __int64 C33 = 0x3333333333333333; >> static const __int64 C0F = 0x0F0F0F0F0F0F0F0F; >> >> __asm { >> movd mm0, word ptr a; >> punpckldq mm0, word ptr a + 4; >> movq mm1, mm0; >> psrld mm0, 1; >> pand mm0, [C55]; >> psubd mm1, mm0; >> movq mm0, mm1; >> psrld mm1, 2; >> pand mm0, [C33]; >> pand mm1, [C33]; >> paddd mm0, mm1; >> movq mm1, mm0; >> psrld mm0, 4; >> paddd mm0, mm1; >> pand mm0, [C0F]; >> pxor mm1, mm1; >> psadbw mm0, mm1; >> movd eax, mm0; >> emms; femms for athlon is faster >// or skip emms at all, if you don't use float >> } >>} > > >I found this modified one slightly faster (saves a few bytes): > >Regards, >Gerd > >--------------------------------------------------------------- > >struct SBitCountConsts >{ > BitBoard C55; > BitBoard C33; > BitBoard C0F; > ... >}; >extern const SBitCountConsts BitCountConsts; > >__forceinline >int PopCount (BitBoard bb) >{ > __asm > { > movd mm0, word ptr bb > punpckldq mm0, word ptr bb + 4 > lea eax, [BitCountConsts] > movq mm1, mm0 > psrld mm0, 1 > pand mm0, [eax].C55 > psubd mm1, mm0 > movq mm0, mm1 > psrld mm1, 2 > pand mm0, [eax].C33 > pand mm1, [eax].C33 > paddd mm0, mm1 > movq mm1, mm0 > psrld mm0, 4 > paddd mm0, mm1 > pand mm0, [eax].C0F > pxor mm1, mm1 > psadbw mm0, mm1 > movd eax, mm0 > } >} /David
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.