Author: Anthony Cozzie
Date: 11:36:25 10/01/03
I personally (as the author of a bitboard engine) find MMX/SSE to offer a lot of possibilities. Much of the time they are of little use, as the data is used immediately after computation, but if it is not, we can save quite a bit: 8 extra registers, and more throughput as well. Plus, I like the feeling of using all the resources of the machine ;) I have been experimenting with the use of intrinsics instead of inline assembly. I feel that intrinsics can offer several advantages over inline assembly: 1. Compiler can interleave other instructions with MMX code (big) 2. More portable: can typedef things, and in general only have to write it once (as opposed to 1 time for GCC/GAS/ATT, and 1 time for VC++/Intel) 3. On platforms without MMX, the code can easily be converted to the standard 64 bit "fake integer" operations. In other words, a given piece of code can be written once instead of three times, and be faster to boot. Unfortunately, this is the theory. Problem 1: MSVC++ seems to do a horrible job at generating assembly First, I wrote some code that used 10 intrinsic variables in one function. I had already figured out how to do this A) without any register spilling (1 load/variable) and B) using a minimum of movq mm0, mm1 type stuff. (4 instructions of this type). When I gave the intrinsic version to MSVC++, it generated about 20 of these wasteful instructions. It also generated the following little gem: 004021FF por mm5,mm0 00402202 movq mm0,mm5 Which could easily be replaced by the _single instruction_ por mm0, mm5 Problem 2: MSVC++ insists on moving the data into the intrinsic variables example: bitboard_to_intrinsic(ibbxai, ibbxa); 0040215E mov eax,dword ptr [esp+4] 00402162 mov dword ptr [esp+34h],ecx ... Basically, any time I want to use their intrinsics, it means I have to first copy the data from the stack to another place on the stack, and then load it into the MMX registers, which obviously defeats the whole purpose of the optimization. I have a suspicion that this is because the __m64 datatype is defined with __declspec(align(8)). Has anyone tried intrinsics and run into these before? Does GCC/Intel C do a better job? It seems to me like this is _really_ bad code generation: anyone who is using intrinsics is going to be doing it for performance, and this clearly is slow. I am using MSVC 7.1. Anthony
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.