Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: AMD64 for chess

Author: Robert Hyatt
Date: 13:11:24 09/23/03
On September 23, 2003 at 15:57:04, Vincent Diepeveen wrote:

>On September 23, 2003 at 15:24:34, Gerd Isenberg wrote:
>
>>On September 23, 2003 at 12:14:15, Vincent Diepeveen wrote:
>>
>>>Hello,
>>>
>>>Many of you are looking forward to the new cheap 64 bits area with AMD64 being
>>>the first 64 bits processor to get released. The economy is not booming bigtime.
>>>We can't complain too loud about economy in the western world, but manufacturers
>>>feel a big decline in sales when economy booms a little less than it used to do.
>>>
>>>Therefore there is a lot of interesting news on the hardware front. Even in 100
>>>pages i could not even describe everything that's interesting to me and new, so
>>>i'll just focus upon what is interesting for computerchess.
>>>
>>>Obvious is the movement in the highend. A few years ago there were big expensive
>>>supercomputer processors which outgunned any cheap PC processor bigtime.
>>>
>>>That has changed for what we call 'integer' software. Still the 'highend'
>>>processors do well for floating point, especially some vector processors, but
>>>for 'integers' which are non-broken numbers; 1 5 -4 etc it's all integer.
>>>
>>>But 0.01 or 5.05 -0.05 -0.0000000134 that's all called floating point.
>>>
>>>There are many testsets which measure processor strengths. That's all not so
>>>interesting. Not very interesting either is what we call SSE2/SSE. SSE is 128
>>>bits stuff.
>>>
>>>Like you can read at www.chip-architect.com a big technical analysis of the
>>>opteron/AMD64 you can see that those special instructions with a lot of bits,
>>>are very slow.
>>>
>>>A 2Ghz clocked opteron is delivering 2 million 'clocks' or 'cycles' a second.
>>>
>>>Trivially the more basic instructions we can execute, the better.
>>>
>>>We see a lot of discussions at the CCC regarding using SSE2/SSE for chess.
>>>
>>
>>Hi Vincent,
>>
>>i guess you address me.
>>Yes, but not exclusively of course.
>>I will give SSE2 a try for some very selected issues.
>>Funny that you care about it ;-)
>
>Not only you, you would be amazed if you know how many persons ask after SSE.
>
>Intel for years has been shouting it from the roofs, just like that SMT/HT which
>seems improved in the P4 EE, at least diep profits 25% from it nowadays at a
>single cpu P4 EE 3.4Ghz. Of course that involved a lot of effort from my side
>too.
>
>Yet for SSE that is not the case. Too slow for computerchess.
>Factor 10 slower than normal register operations is a good compare.
>
>That very sometimes you can do one 'for free' in between is no excuse.
>
>>>However i must always laugh loud when i see that. Let's quote Hans de Vries
>>>regarding the new intel prescott:
>>>
>>>"This would bring back the SSE2 latencies for Add and Multiply to 5 and 7
>>>cycles"
>>>
>>>Note that at the current P4 it is 25% slower than that.
>>>
>>>A good chessprogram can however execute up to 3 'integer' instructions a cycle.
>>>
>>>So that's like 10 times faster on average than using SSE2, even despite that you
>>>can use it in theory 'simultaneously'.
>>
>>
>>I agree, that SSE2 integer instructions are relative slow compared to gp.
>>A few aspects:
>
>>First, SSE2 integer instructions are SIMD - single instructions, multiple data.
>>That means two 64-bit (u)ints, e.g. bitboards - or four 32-bit ints, eight
>>16-bit ints or 16 bytes. If you want to add some appropriate arrays (with
>>saturation!), SSE2 may be a good idea.
>
>In chess you don't know in advance whether you go add that stuff. If you would
>know it, then you could have programmed it already incremental which goes
>faster.
>
>Good example is crafties primitive mobility. Why not do that incremental?
>
>Goes *at least* 2 times faster.

Horsehocky.  Exactly how long does it take to read two bytes of memory to get
the mobility for two diagonals?  Also don't forget lazy eval.  I don't do the
mobility stuff in 90% of the positions scored.

>
>>Second, most SSE2 instructions are currently double direct path instructions,
>>which will take two macro-ops, instead of one (e.g. for MMX), but see below -
>>there is some hope that they become faster in future hammers.
>
>It also means you can't continuesly execute SSE2 instructions.
>
>Say 7 cycles penalties or something for 1 such instruction is like death
>penalty.
>
>>Third, there are even three Floating-Point Execution Units, for X87|MMX|3DNow
>>and SSE/SSE2: FADD, FMUL and FSTORE. Additionaly to the three gp-instructions
>>per cycly the processor may partially execute some independent SSE2 instructions
>>out of order - specially stores (FSTORE) and loads bypassing 1.Level cache -
>>don't forget PREFETCHNTA.
>
>I do understand why you touch the FPU subject, however chess and floating point
>has not much to do with each other. In fact it is possible to write any
>chessprogram without a single floating point instruction.
>
>You know that and i know that.
>
>For some math software it's interesting though.
>
>>The logical/arithmetical SSE2 instructions i intent to use for Kogge-Stone are
>>pand,por,pxor,padd,psub and some shifts which may executed by FADD or either by
>
>You still should run my move generator at the A64. Please do that.
>
>You really should approach it from the other side than you do now.
>
>Then count how many nanoseconds a move gets spent in my public posted move
>generator/activity functions.
>
>From that you can calculate how little SSE2 instructions you are allowed to
>execute.
>
>Then you'll fully understand that it is completely useless to go for SSE2.
>
>>FMUL unit. I do have some practical experience with KoggeStone and MMX (Athlon)
>>and have some severe expectations with SSE2 mixing with general purpose
>>registers.
>>Of course i may have some difficulties with MSC for AMD64 to mix C-statements
>>with SSE2-intrinsics - but i will try and see (Maybe a tiny KoggeStone source
>>code generator?). I will do a pure C-version too. I'm no dogmatist but
>>pragmatist. If SSE2 don't pays off, a bummer - a nice try and fun.
>>But i don't believe - dreaming from one fill iteration in up to eight directions
>>for two disjoint pieces in about 2 cycles (gp/SSE2) huuuh...
>
>>That's me ;-)
>
>Yeah you get confused by the seemingly possibilities there are. But only when
>you understand the window that you have, namely how little SSE2 instructions you
>are allowed to use before you are over that number of nanoseconds, only then
>you'll realize that it's not worth wasting time onto SSE2 for computerchess.
>
>>Another try to profit from these ressources may be to ignore the whole SSE2
>>integer stuff which requires intrinsics or assembler and to do some floating
>>point in your eval, which is SSE/SSE2 - only pure c.
>
>I do have a few divisionsin DIEP's EVAL at certain collection points. I plan to
>rewrite those after the world championship by shifts of the power of 2. That
>requires to rewrite the bonusses too.
>
>That will save me in the future at opteron a couple of hundreds of cycles.
>
>Out of 50000 or so that 1 evaluation costs :)
>
>Forget SSE2. It's just too slow.
>
>They will clock the a64 higher and higher and move to 0.09 as soon as possible.
>It won't get faster. Only the clockspeed will get faster a lot.
>
>Opteron will profit way more from 0.09 than P4 will.
>
>P4 scales perhaps from 3.4Ghz to 3.8Ghz
>
>Opteron will go from 2.2Ghz to 3Ghz *directly*.
>
>This where they do incredible efforts now to reach in 0.13 the 2.4, 2.6 and
>perhaps even 2.8Ghz borders.
>
>So moving to 0.09 is a factor zillion more important than to get SSE2
>instructions faster. They are already faster than the P4 ones.
>
>Just the thing is so incredible much higher clocked. Like 50% or so.
>
>>Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ Processors
>>25112 Rev. 3.02 July 2003
>>---------------------------------------------------------------------------
>>Chapter 9 Optimizing with SIMD Instructions (Pg. 195)
>>
>>The 64-bit and 128-bit SIMD instructions—SSE and SSE2 instructions—should be
>>used to encode floating-point and integer operation.
>>
>>...
>>
>>• Future processors with more or wider multipliers and adders will achieve
>>better throughput using SSE and SSE2 instructions. (Today’s processors implement
>>a 128-bit-wide SSE or SSE2 operation as two 64-bit operations that are
>>internally pipelined.)
>>---------------------------------------------------------------------------
>>
>>
>>>
>>>So for chess the only interesting thign is the integer speed. Integers can be 8
>>>bits, 16 bits, 32 bits and nowadays at opteron also 64 bits.
>>>
>>>A lot of 'commercial' chess programs as well as what officially is called
>>>'strong amateur programs', doesn't necessary say a word on which of the 2 is
>>>stronger, mix 8 bits code with 32 bits code a lot.
>>>
>>>I found out at the opteron (but i could have been misguided of course) that this
>>>is NOT a good idea to do.
>>>
>>>At K7 8 bits code is very fast, at P4 it's already not so interesting to use,
>>>but at opteron it's seemingly a lot slower.
>>>
>>>DIEP doesn't have that problem. DIEP is 32 bits *all the way*.
>>>
>>
>>
>>That's indeed a very good think, that Opteron fully supports 32-bit ints without
>
>More important is that most commercial software tested now at opteron is still
>having the crippling 8 bits code and all kind of code mixing.
>
>Of course whether it gets compiled in 64 bits mode with 16 registers or in 32
>bits mode with 8 registers, that matters a lot for speed (8 registers extra
>would be real real cool).
>
>Yet the most important thing trivially is to not suffer from huge penalties
>which is exactly what it doesn't suffer from.
>
>>further penalties. Directly via mov EAX...EDI, without changing the upper
>>32-bits of the correspondent RAX...RDI 64-bit registers.
>>And there is still movzx/movsx for R08...R15.
>>
>>>With exception of my nodescount, you won't find much 64 bits code YET in DIEP.
>>>
>>>Therefore it is ideal to run on hardware like the opteron/amd64.
>>>
>>>The speed of the AMD64 is very convincing. I need to add one big note and that's
>>>that the latest P4s do a lot better than older P4s. I do not know yet what they
>>>modified at the cores, but it's doing a lot better than it used to do.
>>>
>>>At aceshardware.com you can see the results. Note that for the P4 SMP version
>>>was used, not a NUMA version. Also different versions were used to get faster on
>>>the P4 and dual P4 Xeon.
>>>
>>>I managed to improve diep a lot to run faster on the dual P4 Xeon 3.06Ghz and it
>>>manages with 4 'threads' a speed of 227k nps.
>>>
>>>If we consider that a single cpu AMD64 2.4Ghz already gets 149k nps with the
>>>same executable, then i don't need to comment much more.
>>>
>>>If we compare that with 'old' P4 3.06Ghz which gets 89k nps with this version
>>>and the athlon 2.127Ghz MP2600 which gets single cpu 95k nps, then it is
>>>needless to say that the AMD64 is a big winner.
>>>
>>>It's 50% faster than my K7, which is the highest clocked MP version (MP2800
>>>isn't clocked higher).
>>>
>>>For more details just look at aceshardware.com, my own impression of what was
>>>improved at the AMD64 is especially the branch prediction. As if it hardly
>>>suffers from branchmispredictions. That's really amazing.
>>>
>>>Real new it isn't, but they got it to work great at the AMD64. This in
>>>combination with a larger branch prediction table and all kind of other
>>>advantages is real great.
>>
>>Yes, but hopefully avoiding branches still pays off ;-)
>
>All avoidable branches in diep are gone. See my move generator's datastructure.
>it's superior to anything that you can come up with in ansi-C.
>
>>Cheers,
>>Gerd
>>>Next posting: GCC at the quad opteron
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.