Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Intel compiler, SIMD, and Bitboards

Author: Vincent Diepeveen
Date: 05:50:37 06/05/02
On June 05, 2002 at 00:18:02, Robert Hyatt wrote:

>On June 04, 2002 at 22:07:42, Vincent Diepeveen wrote:
>
>>On June 04, 2002 at 20:34:20, Robert Hyatt wrote:
>>
>>>On June 04, 2002 at 19:05:43, Vincent Diepeveen wrote:
>>>
>>>>On June 04, 2002 at 15:34:38, Sean Mintz wrote:
>>>>
>>>>this is all useless. first you must put it to the mmx registers
>>>>and at the same time your program can't use floating point (which
>>>>for chessprograms won't be a problem but for my other software this
>>>>is a major problem). but that extra instruction to get it to mmx
>>>>then a context switch and another extra instruction to get it back to
>>>>the normal registers.
>>>>
>>>>That's a waste of time.
>>>
>>>Note that you only have to do this _once_.  Then you can leave it alone
>>>since nobody is using the FP hardware in that process.  When you context-
>>>switch to another application, you have thousands of times as much overhead
>>>in doing _other_ things (flushing cache, etc) as you do in resetting the FP
>>>to normal mode.
>>
>>but for business applications where you DO have FPU all the time, this
>>is straight hell to use within the same application.
>
>OK.. are we talking about chess or business applications?  Switching the
>FP from FP mode to SIMD mode within a chess program needs to be done
>one time.  Who cares what happens when context-switching between a chess
>engine and a business application?  The FP switch there can be totally
>ignored compared to the time required for the context switch.

it is too much effort to rewrite my program too assembly. i assume
you don't want to fully rewrite crafty to assembly either.

for business applicatoins that eat system time however no effort
can get spared.

>Therefore, I don't follow what a business application discussion has
>to do with anything here???
>
>
>
>>
>>>
>>>
>>>
>>>
>>>>
>>>>In short it is only helpful if you program in assembly and see the
>>>>registers as something extra to use.
>>>
>>>Or if the compiler suddenly becomes "aware"...
>>
>>no it won't soon. it will simply create more overhead using MMX.
>
>How "more overhead"???
>
>The Cray has some quirks with its large numbers of different kinds of
>registers...  yet moving things between registers is far faster than
>dealing with memory latency...
>
>
>
>>
>>it only get useful when they add a bunch of instructions. That much that
>>you simply don't need to transfer results from mmx to e?x registers.
>>
>>>
>>>>
>>>>As soon as you get to the point where you have mixed data issues,
>>>>such as something in eax which you want to use to combine with mm1
>>>>then you have major problems as you gotta use extra instruction to
>>>>get eax into mm1.
>>>
>>>That is ok.  Compare the time to do that to the hundreds of clock cycles
>>>needed to get it from memory.
>>
>>Not a chance, it's already in one of the 44 renaming registers
>>of the K7 instead if you do not use the MM? registers.
>>
>
>Then why are you complaining about it??
>
>
>
>>>
>>>>
>>>>Now suppose you manage to get them to run independantly, the question
>>>>which is there then is, how do you parallel time this all?
>>>>
>>>>Because the mmx instructions can get executed at a different speed
>>>>than the normal register instructions.
>>>>
>>>>you don't want to already get a result from eax before the previous
>>>>instructions in eax have finished.
>>>>
>>>>you don't want to already zobrist hash this piece into mm1 before
>>>>you know sure that the current hashing has been written to eax:edx
>>>>
>>>>Getting this all to work into a C program is *not* trivial.
>>>>
>>>>In fact it'll slow down once program. We must wait till the hammer
>>>>to be able to do more useful things i fear.
>>>>
>>>>>I was talking w/ Aaron Gordon and he found some interesting stuff in the intel c
>>>>>compiler guide about ''intrinsics''.
>>>>>---------------
>>>>>''The major benefit of using intrinsics is that you now have access to key
>>>>>features that are not available using conventional coding practices. Intrinsics
>>>>>enable you to code with the syntax of C function calls and variables instead of
>>>>>assembly language. Most MMX? technology, Streaming SIMD Extensions, and
>>>>>Streaming SIMD Extensions 2 intrinsics have a corresponding C intrinsic that
>>>>>implements that instruction directly. This frees you from managing registers and
>>>>>enables the compiler to optimize the instruction scheduling.
>>>>>
>>>>>The MMX technology and Streaming SIMD Extension instructions use the following
>>>>>new features:
>>>>>
>>>>>New Registers--Enable packed data of up to 128 bits in length for optimal SIMD
>>>>>processing.
>>>>>
>>>>>New Data Types--Enable packing of up to 16 elements of data in one register.''
>>>>>---------------
>>>>>Here are the data types:
>>>>>---------------
>>>>>''__m64 Data Type
>>>>>The __m64 data type is used to represent the contents of an MMX register, which
>>>>>is the register that is used by the MMX technology intrinsics. The __m64 data
>>>>>type can hold eight 8-bit values, four 16-bit values, two 32-bit values, or one
>>>>>64-bit value.
>>>>>
>>>>>__m128 Data Types
>>>>>The __m128 data type is used to represent the contents of a Streaming SIMD
>>>>>Extension register used by the Streaming SIMD Extension intrinsics. The __m128
>>>>>data type can hold four 32-bit floating values.
>>>>>
>>>>>The __m128d data type can hold two 64-bit floating-point values.
>>>>>
>>>>>The __m128i data type can hold sixteen 8-bit, eight 16-bit, four 32-bit, or two
>>>>>64-bit integer values.
>>>>>
>>>>>The compiler aligns __m128 local and global data to 16-byte boundaries on the
>>>>>stack. To align integer, float, or double arrays, you can use the declspec
>>>>>statement.''
>>>>>---------------
>>>>>Prototypes for these intrinsics and some related macros and constants are in the
>>>>>header file xmmintrin.h.
>>>>>
>>>>>I think it'd be interesting to see if any speedup can be achieved by using these
>>>>>data types. Can anyone run some tests to find out? It would seem to me that if
>>>>>we can hold 64 bit values (using __m64) then we should see a 2x speedup in some
>>>>>cases. Hope this helps some people.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.