Author: Gerd Isenberg
Date: 12:24:34 09/23/03
Go up one level in this thread
On September 23, 2003 at 12:14:15, Vincent Diepeveen wrote: >Hello, > >Many of you are looking forward to the new cheap 64 bits area with AMD64 being >the first 64 bits processor to get released. The economy is not booming bigtime. >We can't complain too loud about economy in the western world, but manufacturers >feel a big decline in sales when economy booms a little less than it used to do. > >Therefore there is a lot of interesting news on the hardware front. Even in 100 >pages i could not even describe everything that's interesting to me and new, so >i'll just focus upon what is interesting for computerchess. > >Obvious is the movement in the highend. A few years ago there were big expensive >supercomputer processors which outgunned any cheap PC processor bigtime. > >That has changed for what we call 'integer' software. Still the 'highend' >processors do well for floating point, especially some vector processors, but >for 'integers' which are non-broken numbers; 1 5 -4 etc it's all integer. > >But 0.01 or 5.05 -0.05 -0.0000000134 that's all called floating point. > >There are many testsets which measure processor strengths. That's all not so >interesting. Not very interesting either is what we call SSE2/SSE. SSE is 128 >bits stuff. > >Like you can read at www.chip-architect.com a big technical analysis of the >opteron/AMD64 you can see that those special instructions with a lot of bits, >are very slow. > >A 2Ghz clocked opteron is delivering 2 million 'clocks' or 'cycles' a second. > >Trivially the more basic instructions we can execute, the better. > >We see a lot of discussions at the CCC regarding using SSE2/SSE for chess. > Hi Vincent, i guess you address me. Yes, but not exclusively of course. I will give SSE2 a try for some very selected issues. Funny that you care about it ;-) >However i must always laugh loud when i see that. Let's quote Hans de Vries >regarding the new intel prescott: > >"This would bring back the SSE2 latencies for Add and Multiply to 5 and 7 >cycles" > >Note that at the current P4 it is 25% slower than that. > >A good chessprogram can however execute up to 3 'integer' instructions a cycle. > >So that's like 10 times faster on average than using SSE2, even despite that you >can use it in theory 'simultaneously'. I agree, that SSE2 integer instructions are relative slow compared to gp. A few aspects: First, SSE2 integer instructions are SIMD - single instructions, multiple data. That means two 64-bit (u)ints, e.g. bitboards - or four 32-bit ints, eight 16-bit ints or 16 bytes. If you want to add some appropriate arrays (with saturation!), SSE2 may be a good idea. Second, most SSE2 instructions are currently double direct path instructions, which will take two macro-ops, instead of one (e.g. for MMX), but see below - there is some hope that they become faster in future hammers. Third, there are even three Floating-Point Execution Units, for X87|MMX|3DNow and SSE/SSE2: FADD, FMUL and FSTORE. Additionaly to the three gp-instructions per cycly the processor may partially execute some independent SSE2 instructions out of order - specially stores (FSTORE) and loads bypassing 1.Level cache - don't forget PREFETCHNTA. The logical/arithmetical SSE2 instructions i intent to use for Kogge-Stone are pand,por,pxor,padd,psub and some shifts which may executed by FADD or either by FMUL unit. I do have some practical experience with KoggeStone and MMX (Athlon) and have some severe expectations with SSE2 mixing with general purpose registers. Of course i may have some difficulties with MSC for AMD64 to mix C-statements with SSE2-intrinsics - but i will try and see (Maybe a tiny KoggeStone source code generator?). I will do a pure C-version too. I'm no dogmatist but pragmatist. If SSE2 don't pays off, a bummer - a nice try and fun. But i don't believe - dreaming from one fill iteration in up to eight directions for two disjoint pieces in about 2 cycles (gp/SSE2) huuuh... That's me ;-) Another try to profit from these ressources may be to ignore the whole SSE2 integer stuff which requires intrinsics or assembler and to do some floating point in your eval, which is SSE/SSE2 - only pure c. Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ Processors 25112 Rev. 3.02 July 2003 --------------------------------------------------------------------------- Chapter 9 Optimizing with SIMD Instructions (Pg. 195) The 64-bit and 128-bit SIMD instructions—SSE and SSE2 instructions—should be used to encode floating-point and integer operation. ... • Future processors with more or wider multipliers and adders will achieve better throughput using SSE and SSE2 instructions. (Today’s processors implement a 128-bit-wide SSE or SSE2 operation as two 64-bit operations that are internally pipelined.) --------------------------------------------------------------------------- > >So for chess the only interesting thign is the integer speed. Integers can be 8 >bits, 16 bits, 32 bits and nowadays at opteron also 64 bits. > >A lot of 'commercial' chess programs as well as what officially is called >'strong amateur programs', doesn't necessary say a word on which of the 2 is >stronger, mix 8 bits code with 32 bits code a lot. > >I found out at the opteron (but i could have been misguided of course) that this >is NOT a good idea to do. > >At K7 8 bits code is very fast, at P4 it's already not so interesting to use, >but at opteron it's seemingly a lot slower. > >DIEP doesn't have that problem. DIEP is 32 bits *all the way*. > That's indeed a very good think, that Opteron fully supports 32-bit ints without further penalties. Directly via mov EAX...EDI, without changing the upper 32-bits of the correspondent RAX...RDI 64-bit registers. And there is still movzx/movsx for R08...R15. >With exception of my nodescount, you won't find much 64 bits code YET in DIEP. > >Therefore it is ideal to run on hardware like the opteron/amd64. > >The speed of the AMD64 is very convincing. I need to add one big note and that's >that the latest P4s do a lot better than older P4s. I do not know yet what they >modified at the cores, but it's doing a lot better than it used to do. > >At aceshardware.com you can see the results. Note that for the P4 SMP version >was used, not a NUMA version. Also different versions were used to get faster on >the P4 and dual P4 Xeon. > >I managed to improve diep a lot to run faster on the dual P4 Xeon 3.06Ghz and it >manages with 4 'threads' a speed of 227k nps. > >If we consider that a single cpu AMD64 2.4Ghz already gets 149k nps with the >same executable, then i don't need to comment much more. > >If we compare that with 'old' P4 3.06Ghz which gets 89k nps with this version >and the athlon 2.127Ghz MP2600 which gets single cpu 95k nps, then it is >needless to say that the AMD64 is a big winner. > >It's 50% faster than my K7, which is the highest clocked MP version (MP2800 >isn't clocked higher). > >For more details just look at aceshardware.com, my own impression of what was >improved at the AMD64 is especially the branch prediction. As if it hardly >suffers from branchmispredictions. That's really amazing. > >Real new it isn't, but they got it to work great at the AMD64. This in >combination with a larger branch prediction table and all kind of other >advantages is real great. Yes, but hopefully avoiding branches still pays off ;-) Cheers, Gerd > >Next posting: GCC at the quad opteron
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.