Author: Vincent Diepeveen
Date: 05:50:37 06/05/02
Go up one level in this thread
On June 05, 2002 at 00:18:02, Robert Hyatt wrote: >On June 04, 2002 at 22:07:42, Vincent Diepeveen wrote: > >>On June 04, 2002 at 20:34:20, Robert Hyatt wrote: >> >>>On June 04, 2002 at 19:05:43, Vincent Diepeveen wrote: >>> >>>>On June 04, 2002 at 15:34:38, Sean Mintz wrote: >>>> >>>>this is all useless. first you must put it to the mmx registers >>>>and at the same time your program can't use floating point (which >>>>for chessprograms won't be a problem but for my other software this >>>>is a major problem). but that extra instruction to get it to mmx >>>>then a context switch and another extra instruction to get it back to >>>>the normal registers. >>>> >>>>That's a waste of time. >>> >>>Note that you only have to do this _once_. Then you can leave it alone >>>since nobody is using the FP hardware in that process. When you context- >>>switch to another application, you have thousands of times as much overhead >>>in doing _other_ things (flushing cache, etc) as you do in resetting the FP >>>to normal mode. >> >>but for business applications where you DO have FPU all the time, this >>is straight hell to use within the same application. > >OK.. are we talking about chess or business applications? Switching the >FP from FP mode to SIMD mode within a chess program needs to be done >one time. Who cares what happens when context-switching between a chess >engine and a business application? The FP switch there can be totally >ignored compared to the time required for the context switch. it is too much effort to rewrite my program too assembly. i assume you don't want to fully rewrite crafty to assembly either. for business applicatoins that eat system time however no effort can get spared. >Therefore, I don't follow what a business application discussion has >to do with anything here??? > > > >> >>> >>> >>> >>> >>>> >>>>In short it is only helpful if you program in assembly and see the >>>>registers as something extra to use. >>> >>>Or if the compiler suddenly becomes "aware"... >> >>no it won't soon. it will simply create more overhead using MMX. > >How "more overhead"??? > >The Cray has some quirks with its large numbers of different kinds of >registers... yet moving things between registers is far faster than >dealing with memory latency... > > > >> >>it only get useful when they add a bunch of instructions. That much that >>you simply don't need to transfer results from mmx to e?x registers. >> >>> >>>> >>>>As soon as you get to the point where you have mixed data issues, >>>>such as something in eax which you want to use to combine with mm1 >>>>then you have major problems as you gotta use extra instruction to >>>>get eax into mm1. >>> >>>That is ok. Compare the time to do that to the hundreds of clock cycles >>>needed to get it from memory. >> >>Not a chance, it's already in one of the 44 renaming registers >>of the K7 instead if you do not use the MM? registers. >> > >Then why are you complaining about it?? > > > >>> >>>> >>>>Now suppose you manage to get them to run independantly, the question >>>>which is there then is, how do you parallel time this all? >>>> >>>>Because the mmx instructions can get executed at a different speed >>>>than the normal register instructions. >>>> >>>>you don't want to already get a result from eax before the previous >>>>instructions in eax have finished. >>>> >>>>you don't want to already zobrist hash this piece into mm1 before >>>>you know sure that the current hashing has been written to eax:edx >>>> >>>>Getting this all to work into a C program is *not* trivial. >>>> >>>>In fact it'll slow down once program. We must wait till the hammer >>>>to be able to do more useful things i fear. >>>> >>>>>I was talking w/ Aaron Gordon and he found some interesting stuff in the intel c >>>>>compiler guide about ''intrinsics''. >>>>>--------------- >>>>>''The major benefit of using intrinsics is that you now have access to key >>>>>features that are not available using conventional coding practices. Intrinsics >>>>>enable you to code with the syntax of C function calls and variables instead of >>>>>assembly language. Most MMX? technology, Streaming SIMD Extensions, and >>>>>Streaming SIMD Extensions 2 intrinsics have a corresponding C intrinsic that >>>>>implements that instruction directly. This frees you from managing registers and >>>>>enables the compiler to optimize the instruction scheduling. >>>>> >>>>>The MMX technology and Streaming SIMD Extension instructions use the following >>>>>new features: >>>>> >>>>>New Registers--Enable packed data of up to 128 bits in length for optimal SIMD >>>>>processing. >>>>> >>>>>New Data Types--Enable packing of up to 16 elements of data in one register.'' >>>>>--------------- >>>>>Here are the data types: >>>>>--------------- >>>>>''__m64 Data Type >>>>>The __m64 data type is used to represent the contents of an MMX register, which >>>>>is the register that is used by the MMX technology intrinsics. The __m64 data >>>>>type can hold eight 8-bit values, four 16-bit values, two 32-bit values, or one >>>>>64-bit value. >>>>> >>>>>__m128 Data Types >>>>>The __m128 data type is used to represent the contents of a Streaming SIMD >>>>>Extension register used by the Streaming SIMD Extension intrinsics. The __m128 >>>>>data type can hold four 32-bit floating values. >>>>> >>>>>The __m128d data type can hold two 64-bit floating-point values. >>>>> >>>>>The __m128i data type can hold sixteen 8-bit, eight 16-bit, four 32-bit, or two >>>>>64-bit integer values. >>>>> >>>>>The compiler aligns __m128 local and global data to 16-byte boundaries on the >>>>>stack. To align integer, float, or double arrays, you can use the declspec >>>>>statement.'' >>>>>--------------- >>>>>Prototypes for these intrinsics and some related macros and constants are in the >>>>>header file xmmintrin.h. >>>>> >>>>>I think it'd be interesting to see if any speedup can be achieved by using these >>>>>data types. Can anyone run some tests to find out? It would seem to me that if >>>>>we can hold 64 bit values (using __m64) then we should see a 2x speedup in some >>>>>cases. Hope this helps some people.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.