Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: AMD64 for chess

Author: Vincent Diepeveen

Date: 12:57:04 09/23/03

Go up one level in this thread


On September 23, 2003 at 15:24:34, Gerd Isenberg wrote:

>On September 23, 2003 at 12:14:15, Vincent Diepeveen wrote:
>
>>Hello,
>>
>>Many of you are looking forward to the new cheap 64 bits area with AMD64 being
>>the first 64 bits processor to get released. The economy is not booming bigtime.
>>We can't complain too loud about economy in the western world, but manufacturers
>>feel a big decline in sales when economy booms a little less than it used to do.
>>
>>Therefore there is a lot of interesting news on the hardware front. Even in 100
>>pages i could not even describe everything that's interesting to me and new, so
>>i'll just focus upon what is interesting for computerchess.
>>
>>Obvious is the movement in the highend. A few years ago there were big expensive
>>supercomputer processors which outgunned any cheap PC processor bigtime.
>>
>>That has changed for what we call 'integer' software. Still the 'highend'
>>processors do well for floating point, especially some vector processors, but
>>for 'integers' which are non-broken numbers; 1 5 -4 etc it's all integer.
>>
>>But 0.01 or 5.05 -0.05 -0.0000000134 that's all called floating point.
>>
>>There are many testsets which measure processor strengths. That's all not so
>>interesting. Not very interesting either is what we call SSE2/SSE. SSE is 128
>>bits stuff.
>>
>>Like you can read at www.chip-architect.com a big technical analysis of the
>>opteron/AMD64 you can see that those special instructions with a lot of bits,
>>are very slow.
>>
>>A 2Ghz clocked opteron is delivering 2 million 'clocks' or 'cycles' a second.
>>
>>Trivially the more basic instructions we can execute, the better.
>>
>>We see a lot of discussions at the CCC regarding using SSE2/SSE for chess.
>>
>
>Hi Vincent,
>
>i guess you address me.
>Yes, but not exclusively of course.
>I will give SSE2 a try for some very selected issues.
>Funny that you care about it ;-)

Not only you, you would be amazed if you know how many persons ask after SSE.

Intel for years has been shouting it from the roofs, just like that SMT/HT which
seems improved in the P4 EE, at least diep profits 25% from it nowadays at a
single cpu P4 EE 3.4Ghz. Of course that involved a lot of effort from my side
too.

Yet for SSE that is not the case. Too slow for computerchess.
Factor 10 slower than normal register operations is a good compare.

That very sometimes you can do one 'for free' in between is no excuse.

>>However i must always laugh loud when i see that. Let's quote Hans de Vries
>>regarding the new intel prescott:
>>
>>"This would bring back the SSE2 latencies for Add and Multiply to 5 and 7
>>cycles"
>>
>>Note that at the current P4 it is 25% slower than that.
>>
>>A good chessprogram can however execute up to 3 'integer' instructions a cycle.
>>
>>So that's like 10 times faster on average than using SSE2, even despite that you
>>can use it in theory 'simultaneously'.
>
>
>I agree, that SSE2 integer instructions are relative slow compared to gp.
>A few aspects:

>First, SSE2 integer instructions are SIMD - single instructions, multiple data.
>That means two 64-bit (u)ints, e.g. bitboards - or four 32-bit ints, eight
>16-bit ints or 16 bytes. If you want to add some appropriate arrays (with
>saturation!), SSE2 may be a good idea.

In chess you don't know in advance whether you go add that stuff. If you would
know it, then you could have programmed it already incremental which goes
faster.

Good example is crafties primitive mobility. Why not do that incremental?

Goes *at least* 2 times faster.

>Second, most SSE2 instructions are currently double direct path instructions,
>which will take two macro-ops, instead of one (e.g. for MMX), but see below -
>there is some hope that they become faster in future hammers.

It also means you can't continuesly execute SSE2 instructions.

Say 7 cycles penalties or something for 1 such instruction is like death
penalty.

>Third, there are even three Floating-Point Execution Units, for X87|MMX|3DNow
>and SSE/SSE2: FADD, FMUL and FSTORE. Additionaly to the three gp-instructions
>per cycly the processor may partially execute some independent SSE2 instructions
>out of order - specially stores (FSTORE) and loads bypassing 1.Level cache -
>don't forget PREFETCHNTA.

I do understand why you touch the FPU subject, however chess and floating point
has not much to do with each other. In fact it is possible to write any
chessprogram without a single floating point instruction.

You know that and i know that.

For some math software it's interesting though.

>The logical/arithmetical SSE2 instructions i intent to use for Kogge-Stone are
>pand,por,pxor,padd,psub and some shifts which may executed by FADD or either by

You still should run my move generator at the A64. Please do that.

You really should approach it from the other side than you do now.

Then count how many nanoseconds a move gets spent in my public posted move
generator/activity functions.

From that you can calculate how little SSE2 instructions you are allowed to
execute.

Then you'll fully understand that it is completely useless to go for SSE2.

>FMUL unit. I do have some practical experience with KoggeStone and MMX (Athlon)
>and have some severe expectations with SSE2 mixing with general purpose
>registers.
>Of course i may have some difficulties with MSC for AMD64 to mix C-statements
>with SSE2-intrinsics - but i will try and see (Maybe a tiny KoggeStone source
>code generator?). I will do a pure C-version too. I'm no dogmatist but
>pragmatist. If SSE2 don't pays off, a bummer - a nice try and fun.
>But i don't believe - dreaming from one fill iteration in up to eight directions
>for two disjoint pieces in about 2 cycles (gp/SSE2) huuuh...

>That's me ;-)

Yeah you get confused by the seemingly possibilities there are. But only when
you understand the window that you have, namely how little SSE2 instructions you
are allowed to use before you are over that number of nanoseconds, only then
you'll realize that it's not worth wasting time onto SSE2 for computerchess.

>Another try to profit from these ressources may be to ignore the whole SSE2
>integer stuff which requires intrinsics or assembler and to do some floating
>point in your eval, which is SSE/SSE2 - only pure c.

I do have a few divisionsin DIEP's EVAL at certain collection points. I plan to
rewrite those after the world championship by shifts of the power of 2. That
requires to rewrite the bonusses too.

That will save me in the future at opteron a couple of hundreds of cycles.

Out of 50000 or so that 1 evaluation costs :)

Forget SSE2. It's just too slow.

They will clock the a64 higher and higher and move to 0.09 as soon as possible.
It won't get faster. Only the clockspeed will get faster a lot.

Opteron will profit way more from 0.09 than P4 will.

P4 scales perhaps from 3.4Ghz to 3.8Ghz

Opteron will go from 2.2Ghz to 3Ghz *directly*.

This where they do incredible efforts now to reach in 0.13 the 2.4, 2.6 and
perhaps even 2.8Ghz borders.

So moving to 0.09 is a factor zillion more important than to get SSE2
instructions faster. They are already faster than the P4 ones.

Just the thing is so incredible much higher clocked. Like 50% or so.

>Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ Processors
>25112 Rev. 3.02 July 2003
>---------------------------------------------------------------------------
>Chapter 9 Optimizing with SIMD Instructions (Pg. 195)
>
>The 64-bit and 128-bit SIMD instructions—SSE and SSE2 instructions—should be
>used to encode floating-point and integer operation.
>
>...
>
>• Future processors with more or wider multipliers and adders will achieve
>better throughput using SSE and SSE2 instructions. (Today’s processors implement
>a 128-bit-wide SSE or SSE2 operation as two 64-bit operations that are
>internally pipelined.)
>---------------------------------------------------------------------------
>
>
>>
>>So for chess the only interesting thign is the integer speed. Integers can be 8
>>bits, 16 bits, 32 bits and nowadays at opteron also 64 bits.
>>
>>A lot of 'commercial' chess programs as well as what officially is called
>>'strong amateur programs', doesn't necessary say a word on which of the 2 is
>>stronger, mix 8 bits code with 32 bits code a lot.
>>
>>I found out at the opteron (but i could have been misguided of course) that this
>>is NOT a good idea to do.
>>
>>At K7 8 bits code is very fast, at P4 it's already not so interesting to use,
>>but at opteron it's seemingly a lot slower.
>>
>>DIEP doesn't have that problem. DIEP is 32 bits *all the way*.
>>
>
>
>That's indeed a very good think, that Opteron fully supports 32-bit ints without

More important is that most commercial software tested now at opteron is still
having the crippling 8 bits code and all kind of code mixing.

Of course whether it gets compiled in 64 bits mode with 16 registers or in 32
bits mode with 8 registers, that matters a lot for speed (8 registers extra
would be real real cool).

Yet the most important thing trivially is to not suffer from huge penalties
which is exactly what it doesn't suffer from.

>further penalties. Directly via mov EAX...EDI, without changing the upper
>32-bits of the correspondent RAX...RDI 64-bit registers.
>And there is still movzx/movsx for R08...R15.
>
>>With exception of my nodescount, you won't find much 64 bits code YET in DIEP.
>>
>>Therefore it is ideal to run on hardware like the opteron/amd64.
>>
>>The speed of the AMD64 is very convincing. I need to add one big note and that's
>>that the latest P4s do a lot better than older P4s. I do not know yet what they
>>modified at the cores, but it's doing a lot better than it used to do.
>>
>>At aceshardware.com you can see the results. Note that for the P4 SMP version
>>was used, not a NUMA version. Also different versions were used to get faster on
>>the P4 and dual P4 Xeon.
>>
>>I managed to improve diep a lot to run faster on the dual P4 Xeon 3.06Ghz and it
>>manages with 4 'threads' a speed of 227k nps.
>>
>>If we consider that a single cpu AMD64 2.4Ghz already gets 149k nps with the
>>same executable, then i don't need to comment much more.
>>
>>If we compare that with 'old' P4 3.06Ghz which gets 89k nps with this version
>>and the athlon 2.127Ghz MP2600 which gets single cpu 95k nps, then it is
>>needless to say that the AMD64 is a big winner.
>>
>>It's 50% faster than my K7, which is the highest clocked MP version (MP2800
>>isn't clocked higher).
>>
>>For more details just look at aceshardware.com, my own impression of what was
>>improved at the AMD64 is especially the branch prediction. As if it hardly
>>suffers from branchmispredictions. That's really amazing.
>>
>>Real new it isn't, but they got it to work great at the AMD64. This in
>>combination with a larger branch prediction table and all kind of other
>>advantages is real great.
>
>Yes, but hopefully avoiding branches still pays off ;-)

All avoidable branches in diep are gone. See my move generator's datastructure.
it's superior to anything that you can come up with in ansi-C.

>Cheers,
>Gerd
>>Next posting: GCC at the quad opteron



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.