Author: Matt Taylor
Date: 20:50:10 12/31/02
Go up one level in this thread
On December 31, 2002 at 21:52:46, Vincent Diepeveen wrote: >On December 31, 2002 at 17:22:42, Rick Terry wrote: > >>A friend of mine who works at USA Computers seems convinced that The Pentium 4 >>512 FSB outperforms the Athlon in Every Bench Mark. I didn't know enough to >>argue with him, repeating only what I heard here about AMD Processors being >>Superior to the Pentium in Running Chess Programs, But Why?, Why would the >>Pentium Perform better then the Athlon in all other Benchmarks but Chess? > >A number of reasons but the most important comes down to next: > >Chessprograms are made by very good programmers and they have optimized them >so well that even the most complex commercial chessprograms basically are >depending upon processor speed whereas those applications of your friend >have to do more with Level2 cache. > >The good chessprogrammers managed to optimize them so far that they >are less relying upon L2 cache speed. > >It is of course very bad that some applications depend upon L2 cache speed. >They should have used better programmers for it! Many applications don't have a choice. Most benchmarks are also less memory-dependent. The key is that the Pentium 4 and Athlon can both run 3 ipc under ideal circumstances, and when memory is taken out of the picture, the Pentium 4 runs more cycles and therefore can execute more code in the same amount of time. >The P4 is having very excellent L2 cache whereas the processor in itself >is a very bad piece of work compared to the much older K7 processor. > >If you look simply what a very complex program can execute a clock cycle >on a P4 versus a K7 then the K7 is having a huge number of resources >and it is more than amazing that a newer generation processor (the P4) >is not even being capable of outperforming it at all when talking >about CPU itself. You mean the P4 has a lower ipc, but I thought we discussed this already. It's designed to ramp to higher speeds than Athlon will ever dream of. As for the nonsense about the L2 cache, they are nearly identical. >That SMT/HT when it gets made for the AMD processor will of course let the >K7 processor profit *way* more than the P4 profits from it (after they >add some registers of course). Possibly. Intel and AMD processors both already have about 40 internal GPRs which the standard 8 x86 GPRs are mapped onto. In HT, you would double the needs to 16 GPRs without needed to increase the total number of internal registers. I think the x86-64 will be much more HT-friendly more because of the additional registers available to the programmer. The additional registers reduce register contention and allow for register-passing schemes. Also, that AMD has better decoders than the Pentium 4 will weigh in their favor, too. (Read: Pentium 4 can issue 3 u-ops/cycle and decode 1 u-op/cycle. Athlon can issue 6 u-ops/cycle and decode 3 u-ops/cycle.) >Intel has however a lot more experience with multiprocessing than AMD, >so we will see how this develops in the future. > >Short look at opteron > - huge L1 cache (sorry that i do not know from head how big exactly, > but k7 is already like 128KB L1 cache versus the P4 has 8KB) > - 16384 BTB/BPT entries versus the P3 had just 512 (K7=2048). I didn't > even lookup how many P4 has because it obviously is very needed for > P4 to have a lot more, but i bet it doesn't have 16384 entries. > - locking the processor seems a lot easier (for my own software) > and costs if i continues lock at 2 processors at the same cache line > like 0.3 seconds at 5 minutes. I get impression that current DIEP > versions perform a lot better at the P4 after i deliberately did > a lot of effort to lock less in DIEP. In fact it locks in such a way > that other processors can run on without *ever* getting hurted. It > can split and search and unlock without other processors seeing it > even. > Yet the average parallel software is not so well written. Then K7 > wins bigtime if it is about cpu speed and locking that cpu. > > On the other hand because of its good level caches (no complaints here > about P4) the effective bandwidth should be higher on P4. Yet P4 is > an entire new design and K7 is already pretty old by now. So if we > take a look to the new design of AMD where the memory controller is on > the cpu then that will outgun of course the P4 by a large margin. > >The positive news is that lately the P4s are improving a lot, yet the first >few benchmarks of the new AMD processors are so very impressive that intel >needs an entire new CPU to compete with that with regard to non chess >programs. If you consider then also that the current K7 is clocked >to 2.x Ghz already and that the opteron is 12 stages or something, >then i am sure the new line of AMD processors can easily get >clocked to 3.0Ghz too. The sad truth is that the only thing that has changed about the P4 since its original release is the cache, more specifically its size. The only things that have changed about Athlon since its original release are cache-related (and the addition of SSE instructions). >In general Intel is a lot faster in releasing its processors than AMD. > >The performance for each clockcycle is a lot higher than the P4 ever >will get, so intel has a big job to release a new processor which combines >the positive things of the P4 with a big processor speed. The difference isn't "a lot higher." The P4 runs lower ipc, but it also runs at a clock speed almost 50% faster. >The sad thing here is that most benchmarks care shit for the actual >processor execution speed for complex software but are basically needing >bigger bandwidth and lower latencies for memory accesses. Actually, if memory latency was the issue, Athlon and Pentium 4 would be neck & neck and chipsets/ram timings would make a difference. While these things do usually affect the results, they do not dramatically change the results of a good CPU benchmark. The Pentium 4 will run lower ipc but higher frequencies. If it's 50% faster than Athlon, Intel can get away with 2/3 the ipc. It can afford to run at nearly 1 ipc if Athlon only gets 2 ipc. >That is real sad for computerchess, because you measure more >how well the caches are than. More like main memory. >AMD definitely is not a hair better here than intel. If we look to the new >opteron then we see it still can execute only 3 instructions a clock. > >A small look back in history learns that the pentiumpro already could do that >and that the P4 and new AMD processors still will do 3 instructions a clock. > >This is real sad. And what do you propose be done to fix it? The fact is that I write IA-32 assembly on a daily basis, and there are few scenarios where I can avoid data dependency by more than 3 cycles. A lot of code won't even get 3 ipc due to data dependency. Many instructions are time-consuming and block the instruction stream for a period of time, too. More registers will help alleviate some of these problems, but I can't see average ipc ever beating ~2.3 or so. >Computerchess will basically profit most from if they go from 3 to like >6 instructions a clock. > >In that respect the new supercomputer CPU's which will get released in >the coming few years by different manufacturers (most notably intel) >will kick major butt in this respect. > >Where a single supercomputer CPU now is clocked at around 1 to 1.3Ghz >versus K7 at 2.x and P4 already at 2.8-3.06ghz, >soon that huge difference will get smaller and smaller and the delay >to release them will be a lot closer to the x86 release. Only after Gateway and Dell start shipping Itaniums in consumer PCs. I doubt that will ever happen. Torvalds said with regard to the x86-64 arch. that it is rare for a server-class chip to come to the desktop -- in other words, he doesn't think that Itanium will be put in consumer PCs. >Then again the supercomputer cpu's will blow away single cpu any x86 >cpu by a large margin, simply because they have a huge L2 or L3 cache >and can execute a huge amount of instructions a clock effectively. > >Way more than any x86 processor can and i fear will do in the near future. Way more than any x86 processor will ever do. Transmeta Crusoe executes 4 ops/cycle, and Astro will do 8 ops/cycle, but that's only after translation of x86 to a different instruction set. The one thing that I think would help a lot would be a three-operand form of the instruction set. I find myself often writing sequences like the following: // do something with eax add eax, 12345 mov ecx, eax shl ecx, 8 add eax, ecx I guess processors could special-case the mov instruction, but I don't think they currently do. As a result, this sequence takes an additional cycle. When you're talking about sequences that take 10-20 cycles, squeezing out another cycle is a 5-10% performance gain. Longer sequences stand to benefit a cycle from multiple places. >In that respect the development of x86 cpu's is not very positive for >computer chess. > >Best regards, >Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.