Author: Gerd Isenberg
Date: 13:00:12 10/01/03
Go up one level in this thread
On October 01, 2003 at 14:36:25, Anthony Cozzie wrote: >I personally (as the author of a bitboard engine) find MMX/SSE to offer a lot of >possibilities. Much of the time they are of little use, as the data is used >immediately after computation, but if it is not, we can save quite a bit: 8 >extra registers, and more throughput as well. Plus, I like the feeling of using >all the resources of the machine ;) Hi Anthony, absolutely same for me. > >I have been experimenting with the use of intrinsics instead of inline assembly. I never tried mmx-intrinsics so far - due to the "stange" intrinsic types. Therefore i'm very interested in your results. > I feel that intrinsics can offer several advantages over inline assembly: > 1. Compiler can interleave other instructions with MMX code (big) > 2. More portable: can typedef things, and in general only have to write it >once (as opposed to 1 time for GCC/GAS/ATT, and 1 time for VC++/Intel) > 3. On platforms without MMX, the code can easily be converted to the standard >64 bit "fake integer" operations. > >In other words, a given piece of code can be written once instead of three >times, and be faster to boot. Unfortunately, this is the theory. > Intrinsics, my "hope" to use sse2 for amd64 ;-) I'm thinking about a kogge stone c-source generator, including a sse2-intrinsics, where i can produce (combined) attack routines with some source and target structures. The generator may be adjustable to produce pure C++ or pure sse2-intrinsics and properly scheduled intermediates, specially using sse2 for rigth rank attacks via eight bytewise x ^ x - 2. >Problem 1: MSVC++ seems to do a horrible job at generating assembly > >First, I wrote some code that used 10 intrinsic variables in one function. I >had already figured out how to do this A) without any register spilling (1 >load/variable) and B) using a minimum of movq mm0, mm1 type stuff. (4 >instructions of this type). When I gave the intrinsic version to MSVC++, it >generated about 20 of these wasteful instructions. It also generated the >following little gem: > >004021FF por mm5,mm0 >00402202 movq mm0,mm5 > >Which could easily be replaced by the _single instruction_ por mm0, mm5 Hmm... > >Problem 2: MSVC++ insists on moving the data into the intrinsic variables > >example: > bitboard_to_intrinsic(ibbxai, ibbxa); >0040215E mov eax,dword ptr [esp+4] >00402162 mov dword ptr [esp+34h],ecx >... > >Basically, any time I want to use their intrinsics, it means I have to first >copy the data from the stack to another place on the stack, and then load it >into the MMX registers, which obviously defeats the whole purpose of the >optimization. I have a suspicion that this is because the __m64 datatype is >defined with __declspec(align(8)). Has anyone tried intrinsics and run into >these before? Does GCC/Intel C do a better job? It seems to me like this is >_really_ bad code generation: anyone who is using intrinsics is going to be >doing it for performance, and this clearly is slow. I am using MSVC 7.1. > >Anthony That sounds all very strange. Have you tried aligned (structs of) unions of __int64 and __m64, probably globals or static class members - and to access via pointer, to load and store some mmx-registers? Some float (x87) interactions? Hopefully these redundant copies and lousy codegeneration is a result of an skipped or not yet available optimization run ;-) Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.