Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Why is P4 less efficient than Athlon (or P3) for chess programs ?

Author: Vincent Diepeveen

Date: 11:37:37 07/04/03

Go up one level in this thread


On July 03, 2003 at 16:48:07, Tom Kerrigan wrote:

>On July 03, 2003 at 16:23:10, Russell Reagan wrote:
>
>>On July 03, 2003 at 15:02:55, Joachim Rang wrote:
>>
>>>The main reason is, that Athlon and P3 have 9 instructions per cycle and P4 has
>>>only 6.
>>
>>Also the length of the pipeline on the P3 is 10, which means that a mispredicted
>>branch costs 10 cycles. On the P4 the length of the pipeline is 20, which means
>>it costs 20 cycles for a mispredicted branch. I may be wrong about the actual
>>numbers (10 and 20, but I think they are close). I'm not sure what the length is
>>on the Athlon. Anyone know?
>
>Pentium 3: 12 cycles

9+ cycles. Usually it's more like 15 though when you measure.

>Pentium 4: 20 cycles

20+ usually it's more like 30+ though.

>Athlon: 10 cycles

there is no official data on this and i won't sign a NDA ever of either intel or
AMD. AMD answerred my public question a few years ago:

"Mr Diepeveen, it is more than the P3, but the exact amount is secret"

This is my only big criticism to AMD. If they release more specs about their
processors then everyone can tune better for it. I understand very well why most
assembly optimized software runs so well on the intel hardware in this respect.
What runs fast at the P4 usually is described publicly by intel.

Also it is easier from sponsors to get intel hardware always.

Combine the 2 reasons add to it that it is a small program fitting in tracecache
and you know why products like Fritz are not doing so bad at the P4.

Good example is that 2 vector instructions in a row seem to be *very* slow at
the K7. Now *how* could i have known this except Gerd posting it here?

>Opteron/Athlon 64: 12 cycles

>In addition to unpredictable branches and parallelism, the P4 also has 8k of L1
>cache vs. the Athlon's 64k. The P4's cache is faster, but that may not make up
>for the difference in size with typical chess programs.

Also the width is bigger at the P4 which means the processor is simply bandwidth
optimized, whereas latency counts for chess whcih is better for the AMD
hardware. Opteron seems to combine both.

But the main reason why intel started the P4 project IMHO is that they figured
correctly out that a higher clocked CPU will sell better than a low clocked CPU.

If you dominate the market that long then it will be always easy to get some
professors put together a test that tests bandwidth rather than IPC. Those guys
are busy with bandwidth anyway instead of how fast you can get when code is well
optimized. So intel was sure that they would win that battle anyway in advance
when combined with the fact that they have their own compiler.

That intel then manages to get such a high clocked CPU is really something they
must be given credit for.

At the specint they score slightly higher than the K7 AND they have the higher
clocked CPU.

From marketing viewpoint really a brilliant achievement from intel.

However in computerchess we are confronted with the reality that chessprograms
are well optimized but with some TSCP and fritz exceptions do not fit in trace
cache.

I forgot what size the BTB (branch target buffer = branch prediction table) of
the P4 is, but the branch prediction table at the K7 is 2048 entries. At the
P6/P2/P3 it is 512. At the opteron 16384.

It is this which is very important for chessprograms (well most of them,
probably chessmaster/fritz/tscp fit within 2048 entries because the positional
knowledge in these programs is limited and the first 2 trivially have optimized
away a lot of branches).

For software that doesn't fit within BTB size, the P4 is big horror of course.

>-Tom



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.