Computer Chess Club Archives


Search

Terms

Messages

Subject: Help! Visual C++ intrinsics! (Nalimov, are you around here somewhere?)

Author: Anthony Cozzie

Date: 11:36:25 10/01/03


I personally (as the author of a bitboard engine) find MMX/SSE to offer a lot of
possibilities.  Much of the time they are of little use, as the data is used
immediately after computation, but if it is not, we can save quite a bit: 8
extra registers, and more throughput as well.  Plus, I like the feeling of using
all the resources of the machine ;)

I have been experimenting with the use of intrinsics instead of inline assembly.
 I feel that intrinsics can offer several advantages over inline assembly:
  1. Compiler can interleave other instructions with MMX code (big)
  2. More portable: can typedef things, and in general only have to write it
once (as opposed to 1 time for GCC/GAS/ATT, and 1 time for VC++/Intel)
  3. On platforms without MMX, the code can easily be converted to the standard
64 bit "fake integer" operations.

In other words, a given piece of code can be written once instead of three
times, and be faster to boot. Unfortunately, this is the theory.

Problem 1: MSVC++ seems to do a horrible job at generating assembly

First, I wrote some code that used 10 intrinsic variables in one function.  I
had already figured out how to do this A) without any register spilling (1
load/variable) and B) using a minimum of movq mm0, mm1 type stuff. (4
instructions of this type). When I gave the intrinsic version to MSVC++, it
generated about 20 of these wasteful instructions.  It also generated the
following little gem:

004021FF  por         mm5,mm0
00402202  movq        mm0,mm5

Which could easily be replaced by the _single instruction_ por mm0, mm5

Problem 2: MSVC++ insists on moving the data into the intrinsic variables

example:
		bitboard_to_intrinsic(ibbxai, ibbxa);
0040215E  mov         eax,dword ptr [esp+4]
00402162  mov         dword ptr [esp+34h],ecx
...

Basically, any time I want to use their intrinsics, it means I have to first
copy the data from the stack to another place on the stack, and then load it
into the MMX registers, which obviously defeats the whole purpose of the
optimization.  I have a suspicion that this is because the __m64 datatype is
defined with __declspec(align(8)).  Has anyone tried intrinsics and run into
these before?  Does GCC/Intel C do a better job?  It seems to me like this is
_really_ bad code generation: anyone who is using intrinsics is going to be
doing it for performance, and this clearly is slow. I am using MSVC 7.1.

Anthony



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.