Author: Vincent Diepeveen
Date: 15:52:52 06/17/04
Go up one level in this thread
On June 17, 2004 at 15:17:23, Anthony Cozzie wrote: >On June 17, 2004 at 13:34:33, Eugene Nalimov wrote: > >>On June 17, 2004 at 13:29:02, Anthony Cozzie wrote: >> >>>On June 17, 2004 at 13:20:40, Eugene Nalimov wrote: >>> >>>>On June 17, 2004 at 06:55:18, Vincent Diepeveen wrote: >>>> >>>>>[...] >>>>> >>>>>Please list the processors in order of L2 cache speed and you'll realize that >>>>>speed still is of overwhelming importance. List them at random access speed for >>>>>L2 cache (some processors are faster in streaming than random access in their >>>>>caches like P4). >>>>> >>>>>Basically opteron has fastest L2 cache which can deliver each 13 cycles data (4 >>>>>reads simultaneously even if i understand well). No other processor can deliver >>>>>data from L2 cache that fast. >>>> >>>>Intel Itanium 2 Processor Reference Manual For Software Development and >>>>Optimization, Table 6-4 "Cache Summary": >>>> >>>>Itanium2 cache latency: >>>> L1: 1 cycle, 4 loads/cycle >>>> L2: 5 cycles (integer loads), 4 loads/cycle >>>> L3: 12/14 cycles, depending on cache size (integer loads), 1 load/cycle >>>> >>>>Thanks, >>>>Eugene >>>> >>> >>>Correct me if I am wrong, but aren't Itanium's caches off by 1? In other words, >>>the 6MB cache on the Itanium is L3, and the L1 cache is like 1KB? >> >>L1D: 16KB >>L1I: 16KB >>L2: 256KB >>L3: 1.5/3/6MB > >That's not as bad as I thought. But it makes me wonder even more why Itanium >isn't clocked higher. With a VLIW core, they should save like 40% of their die >due to not having an issue queue & parallel logic etc. When combined with a >small L1, they should really be clocked at pentium 4 speeds, yet they are much >slower than Opteron. > >anthony It is very bad. Itanium only looks good on paper. how can L3 ever keep filled the L1 cache? 17+ cycles (see www.sara.nl, presentation from Jason Priestly, Strategic marketing manager Intel as given at 1 july 2003 in Amsterdam) for random access. Intel never quotes in documentation the worst case performance which happens when you randomly access their level caches. They just quote the bestcase always. Chess is not streaming software so we must take into consideration worst case. Please also take into account that instructions at IPF are way longer and that you need way more instructions because the instruction set is to say it polite 'limited' when compared to x86. So you sometimes need 6-10 instructions to do what 1 instruction at x86-64 is doing. In case of division of course it is a lot more than that. In case of BSF you need like 6 instructions or so (i forgot exact number, i'm sure Nalimov remembers it). In case of ROR i do not know how many you need. I guess a lot more. A major problem from all that is that many bundles are just filled with NOP's. The difference between using 24 hours of PGO versus not using it is *major*. After 24 hours of pgo the fastest compiler on the itanium platform performs for DIEP very good compared to gcc at itanium2. However even with the best possible compiler and PGO at the itanium2 platform it is just a K7 at 2Ghz for DIEP. Reduce that by 5% even nowadays from which it is unlikely that icc8 is faster than icc7.1 So if that happens you again go look at specifications and you start finding the holes in the processor. They are *huge*. As a DSP processor, IBM with their power series is a lot cheaper nowadays.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.