Author: Robert Hyatt
Date: 13:11:24 09/23/03
Go up one level in this thread
On September 23, 2003 at 15:57:04, Vincent Diepeveen wrote: >On September 23, 2003 at 15:24:34, Gerd Isenberg wrote: > >>On September 23, 2003 at 12:14:15, Vincent Diepeveen wrote: >> >>>Hello, >>> >>>Many of you are looking forward to the new cheap 64 bits area with AMD64 being >>>the first 64 bits processor to get released. The economy is not booming bigtime. >>>We can't complain too loud about economy in the western world, but manufacturers >>>feel a big decline in sales when economy booms a little less than it used to do. >>> >>>Therefore there is a lot of interesting news on the hardware front. Even in 100 >>>pages i could not even describe everything that's interesting to me and new, so >>>i'll just focus upon what is interesting for computerchess. >>> >>>Obvious is the movement in the highend. A few years ago there were big expensive >>>supercomputer processors which outgunned any cheap PC processor bigtime. >>> >>>That has changed for what we call 'integer' software. Still the 'highend' >>>processors do well for floating point, especially some vector processors, but >>>for 'integers' which are non-broken numbers; 1 5 -4 etc it's all integer. >>> >>>But 0.01 or 5.05 -0.05 -0.0000000134 that's all called floating point. >>> >>>There are many testsets which measure processor strengths. That's all not so >>>interesting. Not very interesting either is what we call SSE2/SSE. SSE is 128 >>>bits stuff. >>> >>>Like you can read at www.chip-architect.com a big technical analysis of the >>>opteron/AMD64 you can see that those special instructions with a lot of bits, >>>are very slow. >>> >>>A 2Ghz clocked opteron is delivering 2 million 'clocks' or 'cycles' a second. >>> >>>Trivially the more basic instructions we can execute, the better. >>> >>>We see a lot of discussions at the CCC regarding using SSE2/SSE for chess. >>> >> >>Hi Vincent, >> >>i guess you address me. >>Yes, but not exclusively of course. >>I will give SSE2 a try for some very selected issues. >>Funny that you care about it ;-) > >Not only you, you would be amazed if you know how many persons ask after SSE. > >Intel for years has been shouting it from the roofs, just like that SMT/HT which >seems improved in the P4 EE, at least diep profits 25% from it nowadays at a >single cpu P4 EE 3.4Ghz. Of course that involved a lot of effort from my side >too. > >Yet for SSE that is not the case. Too slow for computerchess. >Factor 10 slower than normal register operations is a good compare. > >That very sometimes you can do one 'for free' in between is no excuse. > >>>However i must always laugh loud when i see that. Let's quote Hans de Vries >>>regarding the new intel prescott: >>> >>>"This would bring back the SSE2 latencies for Add and Multiply to 5 and 7 >>>cycles" >>> >>>Note that at the current P4 it is 25% slower than that. >>> >>>A good chessprogram can however execute up to 3 'integer' instructions a cycle. >>> >>>So that's like 10 times faster on average than using SSE2, even despite that you >>>can use it in theory 'simultaneously'. >> >> >>I agree, that SSE2 integer instructions are relative slow compared to gp. >>A few aspects: > >>First, SSE2 integer instructions are SIMD - single instructions, multiple data. >>That means two 64-bit (u)ints, e.g. bitboards - or four 32-bit ints, eight >>16-bit ints or 16 bytes. If you want to add some appropriate arrays (with >>saturation!), SSE2 may be a good idea. > >In chess you don't know in advance whether you go add that stuff. If you would >know it, then you could have programmed it already incremental which goes >faster. > >Good example is crafties primitive mobility. Why not do that incremental? > >Goes *at least* 2 times faster. Horsehocky. Exactly how long does it take to read two bytes of memory to get the mobility for two diagonals? Also don't forget lazy eval. I don't do the mobility stuff in 90% of the positions scored. > >>Second, most SSE2 instructions are currently double direct path instructions, >>which will take two macro-ops, instead of one (e.g. for MMX), but see below - >>there is some hope that they become faster in future hammers. > >It also means you can't continuesly execute SSE2 instructions. > >Say 7 cycles penalties or something for 1 such instruction is like death >penalty. > >>Third, there are even three Floating-Point Execution Units, for X87|MMX|3DNow >>and SSE/SSE2: FADD, FMUL and FSTORE. Additionaly to the three gp-instructions >>per cycly the processor may partially execute some independent SSE2 instructions >>out of order - specially stores (FSTORE) and loads bypassing 1.Level cache - >>don't forget PREFETCHNTA. > >I do understand why you touch the FPU subject, however chess and floating point >has not much to do with each other. In fact it is possible to write any >chessprogram without a single floating point instruction. > >You know that and i know that. > >For some math software it's interesting though. > >>The logical/arithmetical SSE2 instructions i intent to use for Kogge-Stone are >>pand,por,pxor,padd,psub and some shifts which may executed by FADD or either by > >You still should run my move generator at the A64. Please do that. > >You really should approach it from the other side than you do now. > >Then count how many nanoseconds a move gets spent in my public posted move >generator/activity functions. > >From that you can calculate how little SSE2 instructions you are allowed to >execute. > >Then you'll fully understand that it is completely useless to go for SSE2. > >>FMUL unit. I do have some practical experience with KoggeStone and MMX (Athlon) >>and have some severe expectations with SSE2 mixing with general purpose >>registers. >>Of course i may have some difficulties with MSC for AMD64 to mix C-statements >>with SSE2-intrinsics - but i will try and see (Maybe a tiny KoggeStone source >>code generator?). I will do a pure C-version too. I'm no dogmatist but >>pragmatist. If SSE2 don't pays off, a bummer - a nice try and fun. >>But i don't believe - dreaming from one fill iteration in up to eight directions >>for two disjoint pieces in about 2 cycles (gp/SSE2) huuuh... > >>That's me ;-) > >Yeah you get confused by the seemingly possibilities there are. But only when >you understand the window that you have, namely how little SSE2 instructions you >are allowed to use before you are over that number of nanoseconds, only then >you'll realize that it's not worth wasting time onto SSE2 for computerchess. > >>Another try to profit from these ressources may be to ignore the whole SSE2 >>integer stuff which requires intrinsics or assembler and to do some floating >>point in your eval, which is SSE/SSE2 - only pure c. > >I do have a few divisionsin DIEP's EVAL at certain collection points. I plan to >rewrite those after the world championship by shifts of the power of 2. That >requires to rewrite the bonusses too. > >That will save me in the future at opteron a couple of hundreds of cycles. > >Out of 50000 or so that 1 evaluation costs :) > >Forget SSE2. It's just too slow. > >They will clock the a64 higher and higher and move to 0.09 as soon as possible. >It won't get faster. Only the clockspeed will get faster a lot. > >Opteron will profit way more from 0.09 than P4 will. > >P4 scales perhaps from 3.4Ghz to 3.8Ghz > >Opteron will go from 2.2Ghz to 3Ghz *directly*. > >This where they do incredible efforts now to reach in 0.13 the 2.4, 2.6 and >perhaps even 2.8Ghz borders. > >So moving to 0.09 is a factor zillion more important than to get SSE2 >instructions faster. They are already faster than the P4 ones. > >Just the thing is so incredible much higher clocked. Like 50% or so. > >>Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ Processors >>25112 Rev. 3.02 July 2003 >>--------------------------------------------------------------------------- >>Chapter 9 Optimizing with SIMD Instructions (Pg. 195) >> >>The 64-bit and 128-bit SIMD instructions—SSE and SSE2 instructions—should be >>used to encode floating-point and integer operation. >> >>... >> >>• Future processors with more or wider multipliers and adders will achieve >>better throughput using SSE and SSE2 instructions. (Today’s processors implement >>a 128-bit-wide SSE or SSE2 operation as two 64-bit operations that are >>internally pipelined.) >>--------------------------------------------------------------------------- >> >> >>> >>>So for chess the only interesting thign is the integer speed. Integers can be 8 >>>bits, 16 bits, 32 bits and nowadays at opteron also 64 bits. >>> >>>A lot of 'commercial' chess programs as well as what officially is called >>>'strong amateur programs', doesn't necessary say a word on which of the 2 is >>>stronger, mix 8 bits code with 32 bits code a lot. >>> >>>I found out at the opteron (but i could have been misguided of course) that this >>>is NOT a good idea to do. >>> >>>At K7 8 bits code is very fast, at P4 it's already not so interesting to use, >>>but at opteron it's seemingly a lot slower. >>> >>>DIEP doesn't have that problem. DIEP is 32 bits *all the way*. >>> >> >> >>That's indeed a very good think, that Opteron fully supports 32-bit ints without > >More important is that most commercial software tested now at opteron is still >having the crippling 8 bits code and all kind of code mixing. > >Of course whether it gets compiled in 64 bits mode with 16 registers or in 32 >bits mode with 8 registers, that matters a lot for speed (8 registers extra >would be real real cool). > >Yet the most important thing trivially is to not suffer from huge penalties >which is exactly what it doesn't suffer from. > >>further penalties. Directly via mov EAX...EDI, without changing the upper >>32-bits of the correspondent RAX...RDI 64-bit registers. >>And there is still movzx/movsx for R08...R15. >> >>>With exception of my nodescount, you won't find much 64 bits code YET in DIEP. >>> >>>Therefore it is ideal to run on hardware like the opteron/amd64. >>> >>>The speed of the AMD64 is very convincing. I need to add one big note and that's >>>that the latest P4s do a lot better than older P4s. I do not know yet what they >>>modified at the cores, but it's doing a lot better than it used to do. >>> >>>At aceshardware.com you can see the results. Note that for the P4 SMP version >>>was used, not a NUMA version. Also different versions were used to get faster on >>>the P4 and dual P4 Xeon. >>> >>>I managed to improve diep a lot to run faster on the dual P4 Xeon 3.06Ghz and it >>>manages with 4 'threads' a speed of 227k nps. >>> >>>If we consider that a single cpu AMD64 2.4Ghz already gets 149k nps with the >>>same executable, then i don't need to comment much more. >>> >>>If we compare that with 'old' P4 3.06Ghz which gets 89k nps with this version >>>and the athlon 2.127Ghz MP2600 which gets single cpu 95k nps, then it is >>>needless to say that the AMD64 is a big winner. >>> >>>It's 50% faster than my K7, which is the highest clocked MP version (MP2800 >>>isn't clocked higher). >>> >>>For more details just look at aceshardware.com, my own impression of what was >>>improved at the AMD64 is especially the branch prediction. As if it hardly >>>suffers from branchmispredictions. That's really amazing. >>> >>>Real new it isn't, but they got it to work great at the AMD64. This in >>>combination with a larger branch prediction table and all kind of other >>>advantages is real great. >> >>Yes, but hopefully avoiding branches still pays off ;-) > >All avoidable branches in diep are gone. See my move generator's datastructure. >it's superior to anything that you can come up with in ansi-C. > >>Cheers, >>Gerd >>>Next posting: GCC at the quad opteron
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.