Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: AMD64 for chess

Author: Vincent Diepeveen

Date: 14:43:50 09/23/03

Go up one level in this thread


On September 23, 2003 at 16:38:49, Gerd Isenberg wrote:

>On September 23, 2003 at 15:57:04, Vincent Diepeveen wrote:
>
>>On September 23, 2003 at 15:24:34, Gerd Isenberg wrote:
>>
>>>On September 23, 2003 at 12:14:15, Vincent Diepeveen wrote:
>>>
>>>>Hello,
>>>>
>>>>Many of you are looking forward to the new cheap 64 bits area with AMD64 being
>>>>the first 64 bits processor to get released. The economy is not booming bigtime.
>>>>We can't complain too loud about economy in the western world, but manufacturers
>>>>feel a big decline in sales when economy booms a little less than it used to do.
>>>>
>>>>Therefore there is a lot of interesting news on the hardware front. Even in 100
>>>>pages i could not even describe everything that's interesting to me and new, so
>>>>i'll just focus upon what is interesting for computerchess.
>>>>
>>>>Obvious is the movement in the highend. A few years ago there were big expensive
>>>>supercomputer processors which outgunned any cheap PC processor bigtime.
>>>>
>>>>That has changed for what we call 'integer' software. Still the 'highend'
>>>>processors do well for floating point, especially some vector processors, but
>>>>for 'integers' which are non-broken numbers; 1 5 -4 etc it's all integer.
>>>>
>>>>But 0.01 or 5.05 -0.05 -0.0000000134 that's all called floating point.
>>>>
>>>>There are many testsets which measure processor strengths. That's all not so
>>>>interesting. Not very interesting either is what we call SSE2/SSE. SSE is 128
>>>>bits stuff.
>>>>
>>>>Like you can read at www.chip-architect.com a big technical analysis of the
>>>>opteron/AMD64 you can see that those special instructions with a lot of bits,
>>>>are very slow.
>>>>
>>>>A 2Ghz clocked opteron is delivering 2 million 'clocks' or 'cycles' a second.
>>>>
>>>>Trivially the more basic instructions we can execute, the better.
>>>>
>>>>We see a lot of discussions at the CCC regarding using SSE2/SSE for chess.
>>>>
>>>
>>>Hi Vincent,
>>>
>>>i guess you address me.
>>>Yes, but not exclusively of course.
>>>I will give SSE2 a try for some very selected issues.
>>>Funny that you care about it ;-)
>>
>>Not only you, you would be amazed if you know how many persons ask after SSE.
>>
>>Intel for years has been shouting it from the roofs, just like that SMT/HT which
>>seems improved in the P4 EE, at least diep profits 25% from it nowadays at a
>>single cpu P4 EE 3.4Ghz. Of course that involved a lot of effort from my side
>>too.
>>
>>Yet for SSE that is not the case. Too slow for computerchess.
>>Factor 10 slower than normal register operations is a good compare.
>
>From the manual:
>
>Syntax               Decode FPU        Latency
>                     Type   pipe(s)
>PAND xmmreg1,xmmreg2 Double FADD/FMUL  2
>AND  mreg64, reg64   DirectPath        1
>
>Do you remember my loop test with MMX Kogge-Stone queen attacks?
>With rather independent instuction chains two MMX-instructions (latency 2
>cycles) per cycle! So i expect up to one SSE2 instruction per cycle.
>
>>
>>That very sometimes you can do one 'for free' in between is no excuse.
>>
>>>>However i must always laugh loud when i see that. Let's quote Hans de Vries
>>>>regarding the new intel prescott:
>>>>
>>>>"This would bring back the SSE2 latencies for Add and Multiply to 5 and 7
>>>>cycles"
>>>>
>>>>Note that at the current P4 it is 25% slower than that.
>>>>
>>>>A good chessprogram can however execute up to 3 'integer' instructions a cycle.
>>>>
>>>>So that's like 10 times faster on average than using SSE2, even despite that you
>>>>can use it in theory 'simultaneously'.
>>>
>>>
>>>I agree, that SSE2 integer instructions are relative slow compared to gp.
>>>A few aspects:
>>
>>>First, SSE2 integer instructions are SIMD - single instructions, multiple data.
>>>That means two 64-bit (u)ints, e.g. bitboards - or four 32-bit ints, eight
>>>16-bit ints or 16 bytes. If you want to add some appropriate arrays (with
>>>saturation!), SSE2 may be a good idea.
>>
>>In chess you don't know in advance whether you go add that stuff. If you would
>>know it, then you could have programmed it already incremental which goes
>>faster.
>>
>>Good example is crafties primitive mobility. Why not do that incremental?
>>
>>Goes *at least* 2 times faster.
>>
>>>Second, most SSE2 instructions are currently double direct path instructions,
>>>which will take two macro-ops, instead of one (e.g. for MMX), but see below -
>>>there is some hope that they become faster in future hammers.
>>
>>It also means you can't continuesly execute SSE2 instructions.
>>
>>Say 7 cycles penalties or something for 1 such instruction is like death
>>penalty.
>>
>>>Third, there are even three Floating-Point Execution Units, for X87|MMX|3DNow
>>>and SSE/SSE2: FADD, FMUL and FSTORE. Additionaly to the three gp-instructions
>>>per cycly the processor may partially execute some independent SSE2 instructions
>>>out of order - specially stores (FSTORE) and loads bypassing 1.Level cache -
>>>don't forget PREFETCHNTA.
>>
>>I do understand why you touch the FPU subject, however chess and floating point
>>has not much to do with each other. In fact it is possible to write any
>>chessprogram without a single floating point instruction.
>>
>>You know that and i know that.
>>
>>For some math software it's interesting though.
>>
>>>The logical/arithmetical SSE2 instructions i intent to use for Kogge-Stone are
>>>pand,por,pxor,padd,psub and some shifts which may executed by FADD or either by
>>
>>You still should run my move generator at the A64. Please do that.
>
>It takes some time i get one - has some other expenses recently.
>
>>
>>You really should approach it from the other side than you do now.
>>
>>Then count how many nanoseconds a move gets spent in my public posted move
>>generator/activity functions.
>>
>>From that you can calculate how little SSE2 instructions you are allowed to
>>execute.
>>
>>Then you'll fully understand that it is completely useless to go for SSE2.
>>
>>>FMUL unit. I do have some practical experience with KoggeStone and MMX (Athlon)
>>>and have some severe expectations with SSE2 mixing with general purpose
>>>registers.
>>>Of course i may have some difficulties with MSC for AMD64 to mix C-statements
>>>with SSE2-intrinsics - but i will try and see (Maybe a tiny KoggeStone source
>>>code generator?). I will do a pure C-version too. I'm no dogmatist but
>>>pragmatist. If SSE2 don't pays off, a bummer - a nice try and fun.
>>>But i don't believe - dreaming from one fill iteration in up to eight directions
>>>for two disjoint pieces in about 2 cycles (gp/SSE2) huuuh...
>>
>>>That's me ;-)
>>
>>Yeah you get confused by the seemingly possibilities there are. But only when
>>you understand the window that you have, namely how little SSE2 instructions you
>>are allowed to use before you are over that number of nanoseconds, only then
>>you'll realize that it's not worth wasting time onto SSE2 for computerchess.
>>
>>>Another try to profit from these ressources may be to ignore the whole SSE2
>>>integer stuff which requires intrinsics or assembler and to do some floating
>>>point in your eval, which is SSE/SSE2 - only pure c.
>>
>>I do have a few divisionsin DIEP's EVAL at certain collection points. I plan to
>>rewrite those after the world championship by shifts of the power of 2. That
>>requires to rewrite the bonusses too.
>
>Or use double on opteron. I guess such a collection may profit from SIMD
>SSE/SSE2-instructions.
>
>>
>>That will save me in the future at opteron a couple of hundreds of cycles.
>
>For sure.
>Those 39/71 (DIV) to 42/74 (IDIV) cycles latency are even exclusive due to
>vector path.
>
>>
>>Out of 50000 or so that 1 evaluation costs :)
>>
>>Forget SSE2. It's just too slow.
>>
>
>Not so sure...
>
>
>>They will clock the a64 higher and higher and move to 0.09 as soon as possible.
>>It won't get faster. Only the clockspeed will get faster a lot.
>>
>>Opteron will profit way more from 0.09 than P4 will.
>>
>>P4 scales perhaps from 3.4Ghz to 3.8Ghz
>>
>>Opteron will go from 2.2Ghz to 3Ghz *directly*.
>>
>>This where they do incredible efforts now to reach in 0.13 the 2.4, 2.6 and
>>perhaps even 2.8Ghz borders.
>>
>>So moving to 0.09 is a factor zillion more important than to get SSE2
>>instructions faster. They are already faster than the P4 ones.
>>
>>Just the thing is so incredible much higher clocked. Like 50% or so.
>
>Ok, SSE2 will profit too from higher clock speeds.
>A quadratic improvement?
>
>
><snip>

you still don't get it. you're betting at the wrong horse!

SSE2 simply will *never* execute more than 1 instruction a cycle.

This where it is trivial that at the next generation of processors, after
opteron, the IPC for integer instructions will go up and up.

So you always lose relative to integer performance!

Best regards,
Vincent



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.