Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Intel compiler, SIMD, and Bitboards

Author: Robert Hyatt

Date: 21:18:02 06/04/02

Go up one level in this thread


On June 04, 2002 at 22:07:42, Vincent Diepeveen wrote:

>On June 04, 2002 at 20:34:20, Robert Hyatt wrote:
>
>>On June 04, 2002 at 19:05:43, Vincent Diepeveen wrote:
>>
>>>On June 04, 2002 at 15:34:38, Sean Mintz wrote:
>>>
>>>this is all useless. first you must put it to the mmx registers
>>>and at the same time your program can't use floating point (which
>>>for chessprograms won't be a problem but for my other software this
>>>is a major problem). but that extra instruction to get it to mmx
>>>then a context switch and another extra instruction to get it back to
>>>the normal registers.
>>>
>>>That's a waste of time.
>>
>>Note that you only have to do this _once_.  Then you can leave it alone
>>since nobody is using the FP hardware in that process.  When you context-
>>switch to another application, you have thousands of times as much overhead
>>in doing _other_ things (flushing cache, etc) as you do in resetting the FP
>>to normal mode.
>
>but for business applications where you DO have FPU all the time, this
>is straight hell to use within the same application.

OK.. are we talking about chess or business applications?  Switching the
FP from FP mode to SIMD mode within a chess program needs to be done
one time.  Who cares what happens when context-switching between a chess
engine and a business application?  The FP switch there can be totally
ignored compared to the time required for the context switch.

Therefore, I don't follow what a business application discussion has
to do with anything here???



>
>>
>>
>>
>>
>>>
>>>In short it is only helpful if you program in assembly and see the
>>>registers as something extra to use.
>>
>>Or if the compiler suddenly becomes "aware"...
>
>no it won't soon. it will simply create more overhead using MMX.

How "more overhead"???

The Cray has some quirks with its large numbers of different kinds of
registers...  yet moving things between registers is far faster than
dealing with memory latency...



>
>it only get useful when they add a bunch of instructions. That much that
>you simply don't need to transfer results from mmx to e?x registers.
>
>>
>>>
>>>As soon as you get to the point where you have mixed data issues,
>>>such as something in eax which you want to use to combine with mm1
>>>then you have major problems as you gotta use extra instruction to
>>>get eax into mm1.
>>
>>That is ok.  Compare the time to do that to the hundreds of clock cycles
>>needed to get it from memory.
>
>Not a chance, it's already in one of the 44 renaming registers
>of the K7 instead if you do not use the MM? registers.
>

Then why are you complaining about it??



>>
>>>
>>>Now suppose you manage to get them to run independantly, the question
>>>which is there then is, how do you parallel time this all?
>>>
>>>Because the mmx instructions can get executed at a different speed
>>>than the normal register instructions.
>>>
>>>you don't want to already get a result from eax before the previous
>>>instructions in eax have finished.
>>>
>>>you don't want to already zobrist hash this piece into mm1 before
>>>you know sure that the current hashing has been written to eax:edx
>>>
>>>Getting this all to work into a C program is *not* trivial.
>>>
>>>In fact it'll slow down once program. We must wait till the hammer
>>>to be able to do more useful things i fear.
>>>
>>>>I was talking w/ Aaron Gordon and he found some interesting stuff in the intel c
>>>>compiler guide about ''intrinsics''.
>>>>---------------
>>>>''The major benefit of using intrinsics is that you now have access to key
>>>>features that are not available using conventional coding practices. Intrinsics
>>>>enable you to code with the syntax of C function calls and variables instead of
>>>>assembly language. Most MMX? technology, Streaming SIMD Extensions, and
>>>>Streaming SIMD Extensions 2 intrinsics have a corresponding C intrinsic that
>>>>implements that instruction directly. This frees you from managing registers and
>>>>enables the compiler to optimize the instruction scheduling.
>>>>
>>>>The MMX technology and Streaming SIMD Extension instructions use the following
>>>>new features:
>>>>
>>>>New Registers--Enable packed data of up to 128 bits in length for optimal SIMD
>>>>processing.
>>>>
>>>>New Data Types--Enable packing of up to 16 elements of data in one register.''
>>>>---------------
>>>>Here are the data types:
>>>>---------------
>>>>''__m64 Data Type
>>>>The __m64 data type is used to represent the contents of an MMX register, which
>>>>is the register that is used by the MMX technology intrinsics. The __m64 data
>>>>type can hold eight 8-bit values, four 16-bit values, two 32-bit values, or one
>>>>64-bit value.
>>>>
>>>>__m128 Data Types
>>>>The __m128 data type is used to represent the contents of a Streaming SIMD
>>>>Extension register used by the Streaming SIMD Extension intrinsics. The __m128
>>>>data type can hold four 32-bit floating values.
>>>>
>>>>The __m128d data type can hold two 64-bit floating-point values.
>>>>
>>>>The __m128i data type can hold sixteen 8-bit, eight 16-bit, four 32-bit, or two
>>>>64-bit integer values.
>>>>
>>>>The compiler aligns __m128 local and global data to 16-byte boundaries on the
>>>>stack. To align integer, float, or double arrays, you can use the declspec
>>>>statement.''
>>>>---------------
>>>>Prototypes for these intrinsics and some related macros and constants are in the
>>>>header file xmmintrin.h.
>>>>
>>>>I think it'd be interesting to see if any speedup can be achieved by using these
>>>>data types. Can anyone run some tests to find out? It would seem to me that if
>>>>we can hold 64 bit values (using __m64) then we should see a 2x speedup in some
>>>>cases. Hope this helps some people.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.