Author: Vincent Diepeveen
Date: 19:07:42 06/04/02
Go up one level in this thread
On June 04, 2002 at 20:34:20, Robert Hyatt wrote: >On June 04, 2002 at 19:05:43, Vincent Diepeveen wrote: > >>On June 04, 2002 at 15:34:38, Sean Mintz wrote: >> >>this is all useless. first you must put it to the mmx registers >>and at the same time your program can't use floating point (which >>for chessprograms won't be a problem but for my other software this >>is a major problem). but that extra instruction to get it to mmx >>then a context switch and another extra instruction to get it back to >>the normal registers. >> >>That's a waste of time. > >Note that you only have to do this _once_. Then you can leave it alone >since nobody is using the FP hardware in that process. When you context- >switch to another application, you have thousands of times as much overhead >in doing _other_ things (flushing cache, etc) as you do in resetting the FP >to normal mode. but for business applications where you DO have FPU all the time, this is straight hell to use within the same application. > > > > >> >>In short it is only helpful if you program in assembly and see the >>registers as something extra to use. > >Or if the compiler suddenly becomes "aware"... no it won't soon. it will simply create more overhead using MMX. it only get useful when they add a bunch of instructions. That much that you simply don't need to transfer results from mmx to e?x registers. > >> >>As soon as you get to the point where you have mixed data issues, >>such as something in eax which you want to use to combine with mm1 >>then you have major problems as you gotta use extra instruction to >>get eax into mm1. > >That is ok. Compare the time to do that to the hundreds of clock cycles >needed to get it from memory. Not a chance, it's already in one of the 44 renaming registers of the K7 instead if you do not use the MM? registers. > >> >>Now suppose you manage to get them to run independantly, the question >>which is there then is, how do you parallel time this all? >> >>Because the mmx instructions can get executed at a different speed >>than the normal register instructions. >> >>you don't want to already get a result from eax before the previous >>instructions in eax have finished. >> >>you don't want to already zobrist hash this piece into mm1 before >>you know sure that the current hashing has been written to eax:edx >> >>Getting this all to work into a C program is *not* trivial. >> >>In fact it'll slow down once program. We must wait till the hammer >>to be able to do more useful things i fear. >> >>>I was talking w/ Aaron Gordon and he found some interesting stuff in the intel c >>>compiler guide about ''intrinsics''. >>>--------------- >>>''The major benefit of using intrinsics is that you now have access to key >>>features that are not available using conventional coding practices. Intrinsics >>>enable you to code with the syntax of C function calls and variables instead of >>>assembly language. Most MMX? technology, Streaming SIMD Extensions, and >>>Streaming SIMD Extensions 2 intrinsics have a corresponding C intrinsic that >>>implements that instruction directly. This frees you from managing registers and >>>enables the compiler to optimize the instruction scheduling. >>> >>>The MMX technology and Streaming SIMD Extension instructions use the following >>>new features: >>> >>>New Registers--Enable packed data of up to 128 bits in length for optimal SIMD >>>processing. >>> >>>New Data Types--Enable packing of up to 16 elements of data in one register.'' >>>--------------- >>>Here are the data types: >>>--------------- >>>''__m64 Data Type >>>The __m64 data type is used to represent the contents of an MMX register, which >>>is the register that is used by the MMX technology intrinsics. The __m64 data >>>type can hold eight 8-bit values, four 16-bit values, two 32-bit values, or one >>>64-bit value. >>> >>>__m128 Data Types >>>The __m128 data type is used to represent the contents of a Streaming SIMD >>>Extension register used by the Streaming SIMD Extension intrinsics. The __m128 >>>data type can hold four 32-bit floating values. >>> >>>The __m128d data type can hold two 64-bit floating-point values. >>> >>>The __m128i data type can hold sixteen 8-bit, eight 16-bit, four 32-bit, or two >>>64-bit integer values. >>> >>>The compiler aligns __m128 local and global data to 16-byte boundaries on the >>>stack. To align integer, float, or double arrays, you can use the declspec >>>statement.'' >>>--------------- >>>Prototypes for these intrinsics and some related macros and constants are in the >>>header file xmmintrin.h. >>> >>>I think it'd be interesting to see if any speedup can be achieved by using these >>>data types. Can anyone run some tests to find out? It would seem to me that if >>>we can hold 64 bit values (using __m64) then we should see a 2x speedup in some >>>cases. Hope this helps some people.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.