Author: Vincent Diepeveen
Date: 22:18:03 12/31/02
Go up one level in this thread
On December 31, 2002 at 23:50:10, Matt Taylor wrote: >On December 31, 2002 at 21:52:46, Vincent Diepeveen wrote: > >>On December 31, 2002 at 17:22:42, Rick Terry wrote: >> >>>A friend of mine who works at USA Computers seems convinced that The Pentium 4 >>>512 FSB outperforms the Athlon in Every Bench Mark. I didn't know enough to >>>argue with him, repeating only what I heard here about AMD Processors being >>>Superior to the Pentium in Running Chess Programs, But Why?, Why would the >>>Pentium Perform better then the Athlon in all other Benchmarks but Chess? >> >>A number of reasons but the most important comes down to next: >> >>Chessprograms are made by very good programmers and they have optimized them >>so well that even the most complex commercial chessprograms basically are >>depending upon processor speed whereas those applications of your friend >>have to do more with Level2 cache. >> >>The good chessprogrammers managed to optimize them so far that they >>are less relying upon L2 cache speed. >> >>It is of course very bad that some applications depend upon L2 cache speed. >>They should have used better programmers for it! > >Many applications don't have a choice. Most benchmarks are also less >memory-dependent. > >The key is that the Pentium 4 and Athlon can both run 3 ipc under ideal >circumstances, and when memory is taken out of the picture, the Pentium 4 runs >more cycles and therefore can execute more code in the same amount of time. > >>The P4 is having very excellent L2 cache whereas the processor in itself >>is a very bad piece of work compared to the much older K7 processor. >> >>If you look simply what a very complex program can execute a clock cycle >>on a P4 versus a K7 then the K7 is having a huge number of resources >>and it is more than amazing that a newer generation processor (the P4) >>is not even being capable of outperforming it at all when talking >>about CPU itself. > >You mean the P4 has a lower ipc, but I thought we discussed this already. It's >designed to ramp to higher speeds than Athlon will ever dream of. As for the >nonsense about the L2 cache, they are nearly identical. > >>That SMT/HT when it gets made for the AMD processor will of course let the >>K7 processor profit *way* more than the P4 profits from it (after they >>add some registers of course). > >Possibly. Intel and AMD processors both already have about 40 internal GPRs >which the standard 8 x86 GPRs are mapped onto. In HT, you would double the needs >to 16 GPRs without needed to increase the total number of internal registers. > >I think the x86-64 will be much more HT-friendly more because of the additional >registers available to the programmer. The additional registers reduce register >contention and allow for register-passing schemes. Also, that AMD has better >decoders than the Pentium 4 will weigh in their favor, too. (Read: Pentium 4 can >issue 3 u-ops/cycle and decode 1 u-op/cycle. Athlon can issue 6 u-ops/cycle and >decode 3 u-ops/cycle.) > >>Intel has however a lot more experience with multiprocessing than AMD, >>so we will see how this develops in the future. >> >>Short look at opteron >> - huge L1 cache (sorry that i do not know from head how big exactly, >> but k7 is already like 128KB L1 cache versus the P4 has 8KB) >> - 16384 BTB/BPT entries versus the P3 had just 512 (K7=2048). I didn't >> even lookup how many P4 has because it obviously is very needed for >> P4 to have a lot more, but i bet it doesn't have 16384 entries. >> - locking the processor seems a lot easier (for my own software) >> and costs if i continues lock at 2 processors at the same cache line >> like 0.3 seconds at 5 minutes. I get impression that current DIEP >> versions perform a lot better at the P4 after i deliberately did >> a lot of effort to lock less in DIEP. In fact it locks in such a way >> that other processors can run on without *ever* getting hurted. It >> can split and search and unlock without other processors seeing it >> even. >> Yet the average parallel software is not so well written. Then K7 >> wins bigtime if it is about cpu speed and locking that cpu. >> >> On the other hand because of its good level caches (no complaints here >> about P4) the effective bandwidth should be higher on P4. Yet P4 is >> an entire new design and K7 is already pretty old by now. So if we >> take a look to the new design of AMD where the memory controller is on >> the cpu then that will outgun of course the P4 by a large margin. >> >>The positive news is that lately the P4s are improving a lot, yet the first >>few benchmarks of the new AMD processors are so very impressive that intel >>needs an entire new CPU to compete with that with regard to non chess >>programs. If you consider then also that the current K7 is clocked >>to 2.x Ghz already and that the opteron is 12 stages or something, >>then i am sure the new line of AMD processors can easily get >>clocked to 3.0Ghz too. > >The sad truth is that the only thing that has changed about the P4 since its >original release is the cache, more specifically its size. The only things that >have changed about Athlon since its original release are cache-related (and the >addition of SSE instructions). > >>In general Intel is a lot faster in releasing its processors than AMD. >> >>The performance for each clockcycle is a lot higher than the P4 ever >>will get, so intel has a big job to release a new processor which combines >>the positive things of the P4 with a big processor speed. > >The difference isn't "a lot higher." The P4 runs lower ipc, but it also runs at >a clock speed almost 50% faster. > >>The sad thing here is that most benchmarks care shit for the actual >>processor execution speed for complex software but are basically needing >>bigger bandwidth and lower latencies for memory accesses. > >Actually, if memory latency was the issue, Athlon and Pentium 4 would be neck & >neck and chipsets/ram timings would make a difference. While these things do >usually affect the results, they do not dramatically change the results of a >good CPU benchmark. > >The Pentium 4 will run lower ipc but higher frequencies. If it's 50% faster than >Athlon, Intel can get away with 2/3 the ipc. It can afford to run at nearly 1 >ipc if Athlon only gets 2 ipc. > >>That is real sad for computerchess, because you measure more >>how well the caches are than. > >More like main memory. > >>AMD definitely is not a hair better here than intel. If we look to the new >>opteron then we see it still can execute only 3 instructions a clock. >> >>A small look back in history learns that the pentiumpro already could do that >>and that the P4 and new AMD processors still will do 3 instructions a clock. >> >>This is real sad. > >And what do you propose be done to fix it? The fact is that I write IA-32 >assembly on a daily basis, and there are few scenarios where I can avoid data >dependency by more than 3 cycles. A lot of code won't even get 3 ipc due to data >dependency. Many instructions are time-consuming and block the instruction >stream for a period of time, too. More registers will help alleviate some of >these problems, but I can't see average ipc ever beating ~2.3 or so. > >>Computerchess will basically profit most from if they go from 3 to like >>6 instructions a clock. >> >>In that respect the new supercomputer CPU's which will get released in >>the coming few years by different manufacturers (most notably intel) >>will kick major butt in this respect. >> >>Where a single supercomputer CPU now is clocked at around 1 to 1.3Ghz >>versus K7 at 2.x and P4 already at 2.8-3.06ghz, >>soon that huge difference will get smaller and smaller and the delay >>to release them will be a lot closer to the x86 release. > >Only after Gateway and Dell start shipping Itaniums in consumer PCs. I doubt >that will ever happen. Torvalds said with regard to the x86-64 arch. that it is >rare for a server-class chip to come to the desktop -- in other words, he >doesn't think that Itanium will be put in consumer PCs. > >>Then again the supercomputer cpu's will blow away single cpu any x86 >>cpu by a large margin, simply because they have a huge L2 or L3 cache >>and can execute a huge amount of instructions a clock effectively. >> >>Way more than any x86 processor can and i fear will do in the near future. > >Way more than any x86 processor will ever do. Transmeta Crusoe executes 4 >ops/cycle, and Astro will do 8 ops/cycle, but that's only after translation of >x86 to a different instruction set. > >The one thing that I think would help a lot would be a three-operand form of the >instruction set. I find myself often writing sequences like the following: > >// do something with eax >add eax, 12345 >mov ecx, eax >shl ecx, 8 >add eax, ecx >I guess processors could special-case the mov instruction, but I don't think >they currently do. As a result, this sequence takes an additional cycle. All big software i have written would profit a lot from this. In fact the prayer for new processors is to make a copy of a register 'on the fly' without loss of a cycle. Such a special case not eating an additional cycle would make programming and building compilers a lot easier i bet because the sequential way in which we program is a lot more native than what i currently try to do in DIEP and that is trying to already calculate things i already have to calculate in between those instructions. So for example in C i'm already busy for quite some years with writing code such as a += 123; x += 15; b = array[a]; z = x+x; // x*2 or x<<2 slower possibly though i do such optimizations everywhere now to not lose too much cycles, it is of course pretty sick optimizing in fact like this. the mov as you propose it would definitely have a major impact onto the average program. I am sure less of an impact onto DIEP than it would have for example to crafty, though this is not 100% sure because there is 1 place where i hardly can avoid those penalties and that's around branches. As soon as i get the info i do the branch question to the cpu. that loses for sure an extra cycle because the branch compare is using the info just calculated. Now i have way more branches than crafty so i do not know the total impact of it at all, but i guess the average speedup of the programs would be *huge* when this dream scenario would happen. there is definitely improvement possible here for cpu's! >When >you're talking about sequences that take 10-20 cycles, squeezing out another >cycle is a 5-10% performance gain. Longer sequences stand to benefit a cycle >from multiple places. way way more than 5% for most chessprograms. 10% is definitely the estimation i would *start* at. >>In that respect the development of x86 cpu's is not very positive for >>computer chess. >>Best regards, >>Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.