Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Help! Visual C++ intrinsics! (Nalimov, are you around here somewhere?)

Author: Gerd Isenberg

Date: 13:00:12 10/01/03

Go up one level in this thread


On October 01, 2003 at 14:36:25, Anthony Cozzie wrote:

>I personally (as the author of a bitboard engine) find MMX/SSE to offer a lot of
>possibilities.  Much of the time they are of little use, as the data is used
>immediately after computation, but if it is not, we can save quite a bit: 8
>extra registers, and more throughput as well.  Plus, I like the feeling of using
>all the resources of the machine ;)

Hi Anthony,

absolutely same for me.

>
>I have been experimenting with the use of intrinsics instead of inline assembly.

I never tried mmx-intrinsics so far - due to the "stange" intrinsic types.
Therefore i'm very interested in your results.


> I feel that intrinsics can offer several advantages over inline assembly:
>  1. Compiler can interleave other instructions with MMX code (big)
>  2. More portable: can typedef things, and in general only have to write it
>once (as opposed to 1 time for GCC/GAS/ATT, and 1 time for VC++/Intel)
>  3. On platforms without MMX, the code can easily be converted to the standard
>64 bit "fake integer" operations.
>
>In other words, a given piece of code can be written once instead of three
>times, and be faster to boot. Unfortunately, this is the theory.
>

Intrinsics, my "hope" to use sse2 for amd64 ;-)

I'm thinking about a kogge stone c-source generator, including a
sse2-intrinsics, where i can produce (combined) attack routines with some source
and target structures. The generator may be adjustable to produce pure C++ or
pure sse2-intrinsics and properly scheduled intermediates, specially using sse2
for rigth rank attacks via eight bytewise x ^ x - 2.


>Problem 1: MSVC++ seems to do a horrible job at generating assembly
>
>First, I wrote some code that used 10 intrinsic variables in one function.  I
>had already figured out how to do this A) without any register spilling (1
>load/variable) and B) using a minimum of movq mm0, mm1 type stuff. (4
>instructions of this type). When I gave the intrinsic version to MSVC++, it
>generated about 20 of these wasteful instructions.  It also generated the
>following little gem:
>
>004021FF  por         mm5,mm0
>00402202  movq        mm0,mm5
>
>Which could easily be replaced by the _single instruction_ por mm0, mm5

Hmm...

>
>Problem 2: MSVC++ insists on moving the data into the intrinsic variables
>
>example:
>		bitboard_to_intrinsic(ibbxai, ibbxa);
>0040215E  mov         eax,dword ptr [esp+4]
>00402162  mov         dword ptr [esp+34h],ecx
>...
>
>Basically, any time I want to use their intrinsics, it means I have to first
>copy the data from the stack to another place on the stack, and then load it
>into the MMX registers, which obviously defeats the whole purpose of the
>optimization.  I have a suspicion that this is because the __m64 datatype is
>defined with __declspec(align(8)).  Has anyone tried intrinsics and run into
>these before?  Does GCC/Intel C do a better job?  It seems to me like this is
>_really_ bad code generation: anyone who is using intrinsics is going to be
>doing it for performance, and this clearly is slow. I am using MSVC 7.1.
>
>Anthony

That sounds all very strange.

Have you tried aligned (structs of) unions of __int64 and __m64,
probably globals or static class members - and to access via pointer, to load
and store some mmx-registers?

Some float (x87) interactions?

Hopefully these redundant copies and lousy codegeneration is a result of an
skipped or not yet available optimization run ;-)

Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.