Author: Matt Taylor
Date: 22:39:35 12/31/02
Go up one level in this thread
On January 01, 2003 at 01:18:03, Vincent Diepeveen wrote: >On December 31, 2002 at 23:50:10, Matt Taylor wrote: > >>On December 31, 2002 at 21:52:46, Vincent Diepeveen wrote: >> >>>On December 31, 2002 at 17:22:42, Rick Terry wrote: >>> >>>>A friend of mine who works at USA Computers seems convinced that The Pentium 4 >>>>512 FSB outperforms the Athlon in Every Bench Mark. I didn't know enough to >>>>argue with him, repeating only what I heard here about AMD Processors being >>>>Superior to the Pentium in Running Chess Programs, But Why?, Why would the >>>>Pentium Perform better then the Athlon in all other Benchmarks but Chess? >>> >>>A number of reasons but the most important comes down to next: >>> >>>Chessprograms are made by very good programmers and they have optimized them >>>so well that even the most complex commercial chessprograms basically are >>>depending upon processor speed whereas those applications of your friend >>>have to do more with Level2 cache. >>> >>>The good chessprogrammers managed to optimize them so far that they >>>are less relying upon L2 cache speed. >>> >>>It is of course very bad that some applications depend upon L2 cache speed. >>>They should have used better programmers for it! >> >>Many applications don't have a choice. Most benchmarks are also less >>memory-dependent. >> >>The key is that the Pentium 4 and Athlon can both run 3 ipc under ideal >>circumstances, and when memory is taken out of the picture, the Pentium 4 runs >>more cycles and therefore can execute more code in the same amount of time. >> >>>The P4 is having very excellent L2 cache whereas the processor in itself >>>is a very bad piece of work compared to the much older K7 processor. >>> >>>If you look simply what a very complex program can execute a clock cycle >>>on a P4 versus a K7 then the K7 is having a huge number of resources >>>and it is more than amazing that a newer generation processor (the P4) >>>is not even being capable of outperforming it at all when talking >>>about CPU itself. >> >>You mean the P4 has a lower ipc, but I thought we discussed this already. It's >>designed to ramp to higher speeds than Athlon will ever dream of. As for the >>nonsense about the L2 cache, they are nearly identical. >> >>>That SMT/HT when it gets made for the AMD processor will of course let the >>>K7 processor profit *way* more than the P4 profits from it (after they >>>add some registers of course). >> >>Possibly. Intel and AMD processors both already have about 40 internal GPRs >>which the standard 8 x86 GPRs are mapped onto. In HT, you would double the needs >>to 16 GPRs without needed to increase the total number of internal registers. >> >>I think the x86-64 will be much more HT-friendly more because of the additional >>registers available to the programmer. The additional registers reduce register >>contention and allow for register-passing schemes. Also, that AMD has better >>decoders than the Pentium 4 will weigh in their favor, too. (Read: Pentium 4 can >>issue 3 u-ops/cycle and decode 1 u-op/cycle. Athlon can issue 6 u-ops/cycle and >>decode 3 u-ops/cycle.) >> >>>Intel has however a lot more experience with multiprocessing than AMD, >>>so we will see how this develops in the future. >>> >>>Short look at opteron >>> - huge L1 cache (sorry that i do not know from head how big exactly, >>> but k7 is already like 128KB L1 cache versus the P4 has 8KB) >>> - 16384 BTB/BPT entries versus the P3 had just 512 (K7=2048). I didn't >>> even lookup how many P4 has because it obviously is very needed for >>> P4 to have a lot more, but i bet it doesn't have 16384 entries. >>> - locking the processor seems a lot easier (for my own software) >>> and costs if i continues lock at 2 processors at the same cache line >>> like 0.3 seconds at 5 minutes. I get impression that current DIEP >>> versions perform a lot better at the P4 after i deliberately did >>> a lot of effort to lock less in DIEP. In fact it locks in such a way >>> that other processors can run on without *ever* getting hurted. It >>> can split and search and unlock without other processors seeing it >>> even. >>> Yet the average parallel software is not so well written. Then K7 >>> wins bigtime if it is about cpu speed and locking that cpu. >>> >>> On the other hand because of its good level caches (no complaints here >>> about P4) the effective bandwidth should be higher on P4. Yet P4 is >>> an entire new design and K7 is already pretty old by now. So if we >>> take a look to the new design of AMD where the memory controller is on >>> the cpu then that will outgun of course the P4 by a large margin. >>> >>>The positive news is that lately the P4s are improving a lot, yet the first >>>few benchmarks of the new AMD processors are so very impressive that intel >>>needs an entire new CPU to compete with that with regard to non chess >>>programs. If you consider then also that the current K7 is clocked >>>to 2.x Ghz already and that the opteron is 12 stages or something, >>>then i am sure the new line of AMD processors can easily get >>>clocked to 3.0Ghz too. >> >>The sad truth is that the only thing that has changed about the P4 since its >>original release is the cache, more specifically its size. The only things that >>have changed about Athlon since its original release are cache-related (and the >>addition of SSE instructions). >> >>>In general Intel is a lot faster in releasing its processors than AMD. >>> >>>The performance for each clockcycle is a lot higher than the P4 ever >>>will get, so intel has a big job to release a new processor which combines >>>the positive things of the P4 with a big processor speed. >> >>The difference isn't "a lot higher." The P4 runs lower ipc, but it also runs at >>a clock speed almost 50% faster. >> >>>The sad thing here is that most benchmarks care shit for the actual >>>processor execution speed for complex software but are basically needing >>>bigger bandwidth and lower latencies for memory accesses. >> >>Actually, if memory latency was the issue, Athlon and Pentium 4 would be neck & >>neck and chipsets/ram timings would make a difference. While these things do >>usually affect the results, they do not dramatically change the results of a >>good CPU benchmark. >> >>The Pentium 4 will run lower ipc but higher frequencies. If it's 50% faster than >>Athlon, Intel can get away with 2/3 the ipc. It can afford to run at nearly 1 >>ipc if Athlon only gets 2 ipc. >> >>>That is real sad for computerchess, because you measure more >>>how well the caches are than. >> >>More like main memory. >> >>>AMD definitely is not a hair better here than intel. If we look to the new >>>opteron then we see it still can execute only 3 instructions a clock. >>> >>>A small look back in history learns that the pentiumpro already could do that >>>and that the P4 and new AMD processors still will do 3 instructions a clock. >>> >>>This is real sad. >> >>And what do you propose be done to fix it? The fact is that I write IA-32 >>assembly on a daily basis, and there are few scenarios where I can avoid data >>dependency by more than 3 cycles. A lot of code won't even get 3 ipc due to data >>dependency. Many instructions are time-consuming and block the instruction >>stream for a period of time, too. More registers will help alleviate some of >>these problems, but I can't see average ipc ever beating ~2.3 or so. >> >>>Computerchess will basically profit most from if they go from 3 to like >>>6 instructions a clock. >>> >>>In that respect the new supercomputer CPU's which will get released in >>>the coming few years by different manufacturers (most notably intel) >>>will kick major butt in this respect. >>> >>>Where a single supercomputer CPU now is clocked at around 1 to 1.3Ghz >>>versus K7 at 2.x and P4 already at 2.8-3.06ghz, >>>soon that huge difference will get smaller and smaller and the delay >>>to release them will be a lot closer to the x86 release. >> >>Only after Gateway and Dell start shipping Itaniums in consumer PCs. I doubt >>that will ever happen. Torvalds said with regard to the x86-64 arch. that it is >>rare for a server-class chip to come to the desktop -- in other words, he >>doesn't think that Itanium will be put in consumer PCs. >> >>>Then again the supercomputer cpu's will blow away single cpu any x86 >>>cpu by a large margin, simply because they have a huge L2 or L3 cache >>>and can execute a huge amount of instructions a clock effectively. >>> >>>Way more than any x86 processor can and i fear will do in the near future. >> >>Way more than any x86 processor will ever do. Transmeta Crusoe executes 4 >>ops/cycle, and Astro will do 8 ops/cycle, but that's only after translation of >>x86 to a different instruction set. >> >>The one thing that I think would help a lot would be a three-operand form of the >>instruction set. I find myself often writing sequences like the following: >> >>// do something with eax >>add eax, 12345 >>mov ecx, eax >>shl ecx, 8 >>add eax, ecx > >>I guess processors could special-case the mov instruction, but I don't think >>they currently do. As a result, this sequence takes an additional cycle. > >All big software i have written would profit a lot from this. In fact >the prayer for new processors is to make a copy of a register >'on the fly' without loss of a cycle. Something RISC machines have been doing for a long time, too. >Such a special case not eating an additional cycle would make programming >and building compilers a lot easier i bet because the sequential way in >which we program is a lot more native than what i currently try to do in >DIEP and that is trying to already calculate things i already have to >calculate in between those instructions. > >So for example in C i'm already busy for quite some years with writing >code such as > > a += 123; > x += 15; > b = array[a]; > z = x+x; // x*2 or x<<2 slower possibly Unnecessary in C -- the compiler will reorder them for you if there is no dependency. At the assembly level, the machine will do SOME reordering, but the processor will still suffer if things aren't ordered properly. >though i do such optimizations everywhere now to not lose too much >cycles, it is of course pretty sick optimizing in fact like this. the mov >as you propose it would definitely have a major impact onto the >average program. I am sure less of an impact onto DIEP than it would >have for example to crafty, though this is not 100% sure because >there is 1 place where i hardly can avoid those penalties and that's >around branches. It's more at the machine level, and I've noticed recently that a lot of code has to copy the result to another register to tinker with it some. It wouldn't be a huge gain, but most software would benefit. Considering what AMD did with their 64-bit code, it would be possible. They could steal the push/pop 1-byte range to use as prefixes to specify the destination register. This has the advantage of keeping the decoder (relatively) simple and uniform between 32-bit and 64-bit code, plus you can keep the old opcode form if you prefer. >As soon as i get the info i do the branch question to the cpu. that loses >for sure an extra cycle because the branch compare is using the info just >calculated. If you compile without optimization perhaps, but then you lose a lot more than just 1 cycle. >Now i have way more branches than crafty so i do not know the total impact >of it at all, but i guess the average speedup of the programs would be >*huge* when this dream scenario would happen. > >there is definitely improvement possible here for cpu's! > >>When >>you're talking about sequences that take 10-20 cycles, squeezing out another >>cycle is a 5-10% performance gain. Longer sequences stand to benefit a cycle >>from multiple places. > >way way more than 5% for most chessprograms. 10% is definitely the >estimation i would *start* at. > >>>In that respect the development of x86 cpu's is not very positive for >>>computer chess. > >>>Best regards, >>>Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.