Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Intel compiler, SIMD, and Bitboards

Author: Vincent Diepeveen

Date: 19:07:42 06/04/02

Go up one level in this thread


On June 04, 2002 at 20:34:20, Robert Hyatt wrote:

>On June 04, 2002 at 19:05:43, Vincent Diepeveen wrote:
>
>>On June 04, 2002 at 15:34:38, Sean Mintz wrote:
>>
>>this is all useless. first you must put it to the mmx registers
>>and at the same time your program can't use floating point (which
>>for chessprograms won't be a problem but for my other software this
>>is a major problem). but that extra instruction to get it to mmx
>>then a context switch and another extra instruction to get it back to
>>the normal registers.
>>
>>That's a waste of time.
>
>Note that you only have to do this _once_.  Then you can leave it alone
>since nobody is using the FP hardware in that process.  When you context-
>switch to another application, you have thousands of times as much overhead
>in doing _other_ things (flushing cache, etc) as you do in resetting the FP
>to normal mode.

but for business applications where you DO have FPU all the time, this
is straight hell to use within the same application.

>
>
>
>
>>
>>In short it is only helpful if you program in assembly and see the
>>registers as something extra to use.
>
>Or if the compiler suddenly becomes "aware"...

no it won't soon. it will simply create more overhead using MMX.

it only get useful when they add a bunch of instructions. That much that
you simply don't need to transfer results from mmx to e?x registers.

>
>>
>>As soon as you get to the point where you have mixed data issues,
>>such as something in eax which you want to use to combine with mm1
>>then you have major problems as you gotta use extra instruction to
>>get eax into mm1.
>
>That is ok.  Compare the time to do that to the hundreds of clock cycles
>needed to get it from memory.

Not a chance, it's already in one of the 44 renaming registers
of the K7 instead if you do not use the MM? registers.

>
>>
>>Now suppose you manage to get them to run independantly, the question
>>which is there then is, how do you parallel time this all?
>>
>>Because the mmx instructions can get executed at a different speed
>>than the normal register instructions.
>>
>>you don't want to already get a result from eax before the previous
>>instructions in eax have finished.
>>
>>you don't want to already zobrist hash this piece into mm1 before
>>you know sure that the current hashing has been written to eax:edx
>>
>>Getting this all to work into a C program is *not* trivial.
>>
>>In fact it'll slow down once program. We must wait till the hammer
>>to be able to do more useful things i fear.
>>
>>>I was talking w/ Aaron Gordon and he found some interesting stuff in the intel c
>>>compiler guide about ''intrinsics''.
>>>---------------
>>>''The major benefit of using intrinsics is that you now have access to key
>>>features that are not available using conventional coding practices. Intrinsics
>>>enable you to code with the syntax of C function calls and variables instead of
>>>assembly language. Most MMX? technology, Streaming SIMD Extensions, and
>>>Streaming SIMD Extensions 2 intrinsics have a corresponding C intrinsic that
>>>implements that instruction directly. This frees you from managing registers and
>>>enables the compiler to optimize the instruction scheduling.
>>>
>>>The MMX technology and Streaming SIMD Extension instructions use the following
>>>new features:
>>>
>>>New Registers--Enable packed data of up to 128 bits in length for optimal SIMD
>>>processing.
>>>
>>>New Data Types--Enable packing of up to 16 elements of data in one register.''
>>>---------------
>>>Here are the data types:
>>>---------------
>>>''__m64 Data Type
>>>The __m64 data type is used to represent the contents of an MMX register, which
>>>is the register that is used by the MMX technology intrinsics. The __m64 data
>>>type can hold eight 8-bit values, four 16-bit values, two 32-bit values, or one
>>>64-bit value.
>>>
>>>__m128 Data Types
>>>The __m128 data type is used to represent the contents of a Streaming SIMD
>>>Extension register used by the Streaming SIMD Extension intrinsics. The __m128
>>>data type can hold four 32-bit floating values.
>>>
>>>The __m128d data type can hold two 64-bit floating-point values.
>>>
>>>The __m128i data type can hold sixteen 8-bit, eight 16-bit, four 32-bit, or two
>>>64-bit integer values.
>>>
>>>The compiler aligns __m128 local and global data to 16-byte boundaries on the
>>>stack. To align integer, float, or double arrays, you can use the declspec
>>>statement.''
>>>---------------
>>>Prototypes for these intrinsics and some related macros and constants are in the
>>>header file xmmintrin.h.
>>>
>>>I think it'd be interesting to see if any speedup can be achieved by using these
>>>data types. Can anyone run some tests to find out? It would seem to me that if
>>>we can hold 64 bit values (using __m64) then we should see a 2x speedup in some
>>>cases. Hope this helps some people.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.