Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Assembly Programmers Challenge! (repost and clarification)

Author: Gerd Isenberg

Date: 15:22:00 01/21/03

Go up one level in this thread


On January 21, 2003 at 15:27:26, David Rasmussen wrote:

>What I was hoping for was x86 (Athlon XP, primarily) functions for _all_
>or most of the below simple inline functions, since it seems that MSVC and Intel
>generates horrible code (function calls for shifting etc.!) for these
>fundamental functions. They still manage to be a lot faster than gcc, Borland
>and Sun for some reasons.
>
>For example: (see below)
>
>On January 19, 2003 at 14:12:35, Gerd Isenberg wrote:
>
>>>
>>>(well these are some constants, but maybe they're intereting to some)
>>>const BitBoard lightSquares = 0x55aa55aa | (BitBoard(0x55aa55aa) << 32);
>>>const BitBoard darkSquares = ~lightSquares;
>>>const BitBoard center = 0x18000000 | BitBoard(0x00000018) << 32;
>>>
>>>//INLINE BitBoard Mask(Square square) { return BitBoard(1) << square; }
>
>This code is two loads (loads the bitboard) and a function call(!) on MSVC and
>Intel. So, what is a fast assembly function to do the same? I am sure it can be
>done faster.

Hi David,

yes, see
http://www.talkchess.com/forums/1/message.html?278421

with MSC there is no way to skip the unnecessary store/load prefix with inlined
ams functions. So small lookup tables in C is probably the fastest.


>
>>>INLINE BitBoard Mask(Square square) { return mask[square]; }
>>>INLINE BitBoard RankMask(Rank rank) { return rankMask[rank]; }
>
>rankmask is (as expected) a mask of all 1's at the relevant rank, so it's
>11111111 shifted rank*8 times to the left, if rank is zero-indexed. This can
>probably be done faster than a memory lookup too, if it's not put in the hands
>of MSVC and Intel, which would probably just do the shift with a function call.
>So, again: A faster assembly function should be possible. Please help, assembly
>programmers!
>


>Pretty please?

Learning assembler by yourself or waiting for hammer ;-)


>
<snip>

>I'm sure it's fine, but what I would like is x86 assembly functions for these
>instead of C++ functions since the compilers I've tried generates lousy code. >I'm asking all you assembly programmers to help me (and others making bitboard
>programs on x86), because I'm not much of an assembly programmer myself.

See the link above.

>
>For example:
>
>>>I don't remember where I got these, I probably stole them or copied them from
>>>discussions on CCC. Maybe they can be even faster:
>>>
INLINE int FirstBit(const BitBoard bitboard)
{
	__asm
	{
		bsf		eax,[bitboard+4]
		xor		eax, 32
		bsf		eax,[bitboard]
	}
}

INLINE int LastBit(const BitBoard bitboard)
{
	__asm
	{
		bsr eax,[bitboard]
		sub eax,32
		bsr eax,[bitboard+4]
		add eax,32
	}
}
>>
<snip>
>
>I have no idea. I didn't make those functions, I just stole them or borrowed
>them. So if they can be optimized, then please do! But it's not only these
>assembly language functions that I want you to look at, I want functions for
>some or all of the above inline functions that are now in C++.
>

Simply replace them.

>>
>>
>>>
>>>INLINE int PopCount(BitBoard a) // MMX
>>>{
>>>    static const __int64 C55 = 0x5555555555555555;
>>>    static const __int64 C33 = 0x3333333333333333;
>>>    static const __int64 C0F = 0x0F0F0F0F0F0F0F0F;
>>>
>>>    __asm {
>>>        movd            mm0, word ptr a;
>>>        punpckldq       mm0, word ptr a + 4;
>>>        movq            mm1, mm0;
>>>        psrld           mm0, 1;
>>>        pand            mm0, [C55];
>>>        psubd           mm1, mm0;
>>>        movq            mm0, mm1;
>>>        psrld           mm1, 2;
>>>        pand            mm0, [C33];
>>>        pand            mm1, [C33];
>>>        paddd           mm0, mm1;
>>>        movq            mm1, mm0;
>>>        psrld           mm0, 4;
>>>        paddd           mm0, mm1;
>>>        pand            mm0, [C0F];
>>>        pxor            mm1, mm1;
>>>        psadbw mm0, mm1;
>>>        movd            eax, mm0;
>>>        emms;  femms for athlon is faster
>>//  or skip emms at all, if you don't use float
>>>    }
>>>}
>>
>>
>>I found this modified one slightly faster (saves a few bytes):
>>
>>Regards,
>>Gerd
>>
>>---------------------------------------------------------------
>>
>>struct SBitCountConsts
>>{
>>	BitBoard C55;
>>	BitBoard C33;
>>	BitBoard C0F;
>>        ...
>>};
>>extern const SBitCountConsts BitCountConsts;
>>
>>__forceinline
>>int PopCount (BitBoard bb)
>>{
>>	__asm
>>	{
>>		movd  mm0, word ptr bb
>>		punpckldq mm0, word ptr bb + 4
>>		lea   eax, [BitCountConsts]
>>		movq  mm1, mm0
>>		psrld mm0, 1
>>		pand  mm0, [eax].C55
>>		psubd mm1, mm0
>>		movq  mm0, mm1
>>		psrld mm1, 2
>>		pand  mm0, [eax].C33
>>		pand  mm1, [eax].C33
>>		paddd mm0, mm1
>>		movq  mm1, mm0
>>		psrld mm0, 4
>>		paddd mm0, mm1
>>		pand  mm0, [eax].C0F
>>		pxor  mm1, mm1
>>		psadbw mm0, mm1
>>		movd  eax, mm0
>>	}
>>}
>
>OK, I will try it (if I can understand how to use it)
>

struct SBitCountConsts.. is C. You simply have to initialize the struct
somewhere in a c-file with appropriate constants. For the rest simply replace
the asm-body.

Gerd

>Please help me! I suck at assembler!
>
>/David



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.