Author: Vincent Diepeveen
Date: 11:37:37 07/04/03
Go up one level in this thread
On July 03, 2003 at 16:48:07, Tom Kerrigan wrote: >On July 03, 2003 at 16:23:10, Russell Reagan wrote: > >>On July 03, 2003 at 15:02:55, Joachim Rang wrote: >> >>>The main reason is, that Athlon and P3 have 9 instructions per cycle and P4 has >>>only 6. >> >>Also the length of the pipeline on the P3 is 10, which means that a mispredicted >>branch costs 10 cycles. On the P4 the length of the pipeline is 20, which means >>it costs 20 cycles for a mispredicted branch. I may be wrong about the actual >>numbers (10 and 20, but I think they are close). I'm not sure what the length is >>on the Athlon. Anyone know? > >Pentium 3: 12 cycles 9+ cycles. Usually it's more like 15 though when you measure. >Pentium 4: 20 cycles 20+ usually it's more like 30+ though. >Athlon: 10 cycles there is no official data on this and i won't sign a NDA ever of either intel or AMD. AMD answerred my public question a few years ago: "Mr Diepeveen, it is more than the P3, but the exact amount is secret" This is my only big criticism to AMD. If they release more specs about their processors then everyone can tune better for it. I understand very well why most assembly optimized software runs so well on the intel hardware in this respect. What runs fast at the P4 usually is described publicly by intel. Also it is easier from sponsors to get intel hardware always. Combine the 2 reasons add to it that it is a small program fitting in tracecache and you know why products like Fritz are not doing so bad at the P4. Good example is that 2 vector instructions in a row seem to be *very* slow at the K7. Now *how* could i have known this except Gerd posting it here? >Opteron/Athlon 64: 12 cycles >In addition to unpredictable branches and parallelism, the P4 also has 8k of L1 >cache vs. the Athlon's 64k. The P4's cache is faster, but that may not make up >for the difference in size with typical chess programs. Also the width is bigger at the P4 which means the processor is simply bandwidth optimized, whereas latency counts for chess whcih is better for the AMD hardware. Opteron seems to combine both. But the main reason why intel started the P4 project IMHO is that they figured correctly out that a higher clocked CPU will sell better than a low clocked CPU. If you dominate the market that long then it will be always easy to get some professors put together a test that tests bandwidth rather than IPC. Those guys are busy with bandwidth anyway instead of how fast you can get when code is well optimized. So intel was sure that they would win that battle anyway in advance when combined with the fact that they have their own compiler. That intel then manages to get such a high clocked CPU is really something they must be given credit for. At the specint they score slightly higher than the K7 AND they have the higher clocked CPU. From marketing viewpoint really a brilliant achievement from intel. However in computerchess we are confronted with the reality that chessprograms are well optimized but with some TSCP and fritz exceptions do not fit in trace cache. I forgot what size the BTB (branch target buffer = branch prediction table) of the P4 is, but the branch prediction table at the K7 is 2048 entries. At the P6/P2/P3 it is 512. At the opteron 16384. It is this which is very important for chessprograms (well most of them, probably chessmaster/fritz/tscp fit within 2048 entries because the positional knowledge in these programs is limited and the first 2 trivially have optimized away a lot of branches). For software that doesn't fit within BTB size, the P4 is big horror of course. >-Tom
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.