Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Why Does AMD out perform The Pentium Processor in Chess Only?

Author: Vincent Diepeveen

Date: 22:18:03 12/31/02

Go up one level in this thread


On December 31, 2002 at 23:50:10, Matt Taylor wrote:

>On December 31, 2002 at 21:52:46, Vincent Diepeveen wrote:
>
>>On December 31, 2002 at 17:22:42, Rick Terry wrote:
>>
>>>A friend of mine who works at USA Computers seems convinced that The Pentium 4
>>>512 FSB outperforms the Athlon in Every Bench Mark. I didn't know enough to
>>>argue with him, repeating only what I heard here about AMD Processors being
>>>Superior to the Pentium in Running Chess Programs, But Why?, Why would the
>>>Pentium Perform better then the Athlon in all other Benchmarks but Chess?
>>
>>A number of reasons but the most important comes down to next:
>>
>>Chessprograms are made by very good programmers and they have optimized them
>>so well that even the most complex commercial chessprograms basically are
>>depending upon processor speed whereas those applications of your friend
>>have to do more with Level2 cache.
>>
>>The good chessprogrammers managed to optimize them so far that they
>>are less relying upon L2 cache speed.
>>
>>It is of course very bad that some applications depend upon L2 cache speed.
>>They should have used better programmers for it!
>
>Many applications don't have a choice. Most benchmarks are also less
>memory-dependent.
>
>The key is that the Pentium 4 and Athlon can both run 3 ipc under ideal
>circumstances, and when memory is taken out of the picture, the Pentium 4 runs
>more cycles and therefore can execute more code in the same amount of time.
>
>>The P4 is having very excellent L2 cache whereas the processor in itself
>>is a very bad piece of work compared to the much older K7 processor.
>>
>>If you look simply what a very complex program can execute a clock cycle
>>on a P4 versus a K7 then the K7 is having a huge number of resources
>>and it is more than amazing that a newer generation processor (the P4)
>>is not even being capable of outperforming it at all when talking
>>about CPU itself.
>
>You mean the P4 has a lower ipc, but I thought we discussed this already. It's
>designed to ramp to higher speeds than Athlon will ever dream of. As for the
>nonsense about the L2 cache, they are nearly identical.
>
>>That SMT/HT when it gets made for the AMD processor will of course let the
>>K7 processor profit *way* more than the P4 profits from it (after they
>>add some registers of course).
>
>Possibly. Intel and AMD processors both already have about 40 internal GPRs
>which the standard 8 x86 GPRs are mapped onto. In HT, you would double the needs
>to 16 GPRs without needed to increase the total number of internal registers.
>
>I think the x86-64 will be much more HT-friendly more because of the additional
>registers available to the programmer. The additional registers reduce register
>contention and allow for register-passing schemes. Also, that AMD has better
>decoders than the Pentium 4 will weigh in their favor, too. (Read: Pentium 4 can
>issue 3 u-ops/cycle and decode 1 u-op/cycle. Athlon can issue 6 u-ops/cycle and
>decode 3 u-ops/cycle.)
>
>>Intel has however a lot more experience with multiprocessing than AMD,
>>so we will see how this develops in the future.
>>
>>Short look at opteron
>>  - huge L1 cache (sorry that i do not know from head how big exactly,
>>    but k7 is already like 128KB L1 cache versus the P4 has 8KB)
>>  - 16384 BTB/BPT entries versus the P3 had just 512 (K7=2048). I didn't
>>    even lookup how many P4 has because it obviously is very needed for
>>    P4 to have a lot more, but i bet it doesn't have 16384 entries.
>>  - locking the processor seems a lot easier (for my own software)
>>    and costs if i continues lock at 2 processors at the same cache line
>>    like 0.3 seconds at 5 minutes. I get impression that current DIEP
>>    versions perform a lot better at the P4 after i deliberately did
>>    a lot of effort to lock less in DIEP. In fact it locks in such a way
>>    that other processors can run on without *ever* getting hurted. It
>>    can split and search and unlock without other processors seeing it
>>    even.
>>    Yet the average parallel software is not so well written. Then K7
>>    wins bigtime if it is about cpu speed and locking that cpu.
>>
>>    On the other hand because of its good level caches (no complaints here
>>    about P4) the effective bandwidth should be higher on P4. Yet P4 is
>>    an entire new design and K7 is already pretty old by now. So if we
>>    take a look to the new design of AMD where the memory controller is on
>>    the cpu then that will outgun of course the P4 by a large margin.
>>
>>The positive news is that lately the P4s are improving a lot, yet the first
>>few benchmarks of the new AMD processors are so very impressive that intel
>>needs an entire new CPU to compete with that with regard to non chess
>>programs. If you consider then also that the current K7 is clocked
>>to 2.x Ghz already and that the opteron is 12 stages or something,
>>then i am sure the new line of AMD processors can easily get
>>clocked to 3.0Ghz too.
>
>The sad truth is that the only thing that has changed about the P4 since its
>original release is the cache, more specifically its size. The only things that
>have changed about Athlon since its original release are cache-related (and the
>addition of SSE instructions).
>
>>In general Intel is a lot faster in releasing its processors than AMD.
>>
>>The performance for each clockcycle is a lot higher than the P4 ever
>>will get, so intel has a big job to release a new processor which combines
>>the positive things of the P4 with a big processor speed.
>
>The difference isn't "a lot higher." The P4 runs lower ipc, but it also runs at
>a clock speed almost 50% faster.
>
>>The sad thing here is that most benchmarks care shit for the actual
>>processor execution speed for complex software but are basically needing
>>bigger bandwidth and lower latencies for memory accesses.
>
>Actually, if memory latency was the issue, Athlon and Pentium 4 would be neck &
>neck and chipsets/ram timings would make a difference. While these things do
>usually affect the results, they do not dramatically change the results of a
>good CPU benchmark.
>
>The Pentium 4 will run lower ipc but higher frequencies. If it's 50% faster than
>Athlon, Intel can get away with 2/3 the ipc. It can afford to run at nearly 1
>ipc if Athlon only gets 2 ipc.
>
>>That is real sad for computerchess, because you measure more
>>how well the caches are than.
>
>More like main memory.
>
>>AMD definitely is not a hair better here than intel. If we look to the new
>>opteron then we see it still can execute only 3 instructions a clock.
>>
>>A small look back in history learns that the pentiumpro already could do that
>>and that the P4 and new AMD processors still will do 3 instructions a clock.
>>
>>This is real sad.
>
>And what do you propose be done to fix it? The fact is that I write IA-32
>assembly on a daily basis, and there are few scenarios where I can avoid data
>dependency by more than 3 cycles. A lot of code won't even get 3 ipc due to data
>dependency. Many instructions are time-consuming and block the instruction
>stream for a period of time, too. More registers will help alleviate some of
>these problems, but I can't see average ipc ever beating ~2.3 or so.
>
>>Computerchess will basically profit most from if they go from 3 to like
>>6 instructions a clock.
>>
>>In that respect the new supercomputer CPU's which will get released in
>>the coming few years by different manufacturers (most notably intel)
>>will kick major butt in this respect.
>>
>>Where a single supercomputer CPU now is clocked at around 1 to 1.3Ghz
>>versus K7 at 2.x and P4 already at 2.8-3.06ghz,
>>soon that huge difference will get smaller and smaller and the delay
>>to release them will be a lot closer to the x86 release.
>
>Only after Gateway and Dell start shipping Itaniums in consumer PCs. I doubt
>that will ever happen. Torvalds said with regard to the x86-64 arch. that it is
>rare for a server-class chip to come to the desktop -- in other words, he
>doesn't think that Itanium will be put in consumer PCs.
>
>>Then again the supercomputer cpu's will blow away single cpu any x86
>>cpu by a large margin, simply because they have a huge L2 or L3 cache
>>and can execute a huge amount of instructions a clock effectively.
>>
>>Way more than any x86 processor can and i fear will do in the near future.
>
>Way more than any x86 processor will ever do. Transmeta Crusoe executes 4
>ops/cycle, and Astro will do 8 ops/cycle, but that's only after translation of
>x86 to a different instruction set.
>
>The one thing that I think would help a lot would be a three-operand form of the
>instruction set. I find myself often writing sequences like the following:
>
>// do something with eax
>add eax, 12345
>mov ecx, eax
>shl ecx, 8
>add eax, ecx

>I guess processors could special-case the mov instruction, but I don't think
>they currently do. As a result, this sequence takes an additional cycle.

All big software i have written would profit a lot from this. In fact
the prayer for new processors is to make a copy of a register
'on the fly' without loss of a cycle.

Such a special case not eating an additional cycle would make programming
and building compilers a lot easier i bet because the sequential way in
which we program is a lot more native than what i currently try to do in
DIEP and that is trying to already calculate things i already have to
calculate in between those instructions.

So for example in C i'm already busy for quite some years with writing
code such as

   a += 123;
   x += 15;
   b =  array[a];
   z = x+x; // x*2 or x<<2 slower possibly

though i do such optimizations everywhere now to not lose too much
cycles, it is of course pretty sick optimizing in fact like this. the mov
as you propose it would definitely have a major impact onto the
average program. I am sure less of an impact onto DIEP than it would
have for example to crafty, though this is not 100% sure because
there is 1 place where i hardly can avoid those penalties and that's
around branches.

As soon as i get the info i do the branch question to the cpu. that loses
for sure an extra cycle because the branch compare is using the info just
calculated.

Now i have way more branches than crafty so i do not know the total impact
of it at all, but i guess the average speedup of the programs would be
*huge* when this dream scenario would happen.

there is definitely improvement possible here for cpu's!

>When
>you're talking about sequences that take 10-20 cycles, squeezing out another
>cycle is a 5-10% performance gain. Longer sequences stand to benefit a cycle
>from multiple places.

way way more than 5% for most chessprograms. 10% is definitely the
estimation i would *start* at.

>>In that respect the development of x86 cpu's is not very positive for
>>computer chess.

>>Best regards,
>>Vincent



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.