Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: IA-64 vs OOOE (attn Taylor, Hyatt)

Author: Matt Taylor

Date: 20:06:47 02/12/03

Go up one level in this thread


On February 12, 2003 at 16:13:29, Tom Kerrigan wrote:

>On February 12, 2003 at 11:47:21, Robert Hyatt wrote:
>
>>But I don't
>>think OOOE is _nearly_ as important for decent architectures as it is for
>>architectures
>>that have significant design problems like X86.
>
>Fine, it's a 30% benefit to Alpha and MIPS, maybe it's 40% for x86...
>
>>>>The Cray T932 was the last 64 bit machine they built that I used.  And it
>>>How many NPS does Crafty get on it?
>>about 7M.
>>And that was Cray Blitz, not Crafty.  I have not tried to run Crafty on a Cray,
>
>7M per processor? How many processors did those Crays come with? 32? So Crazy
>Blitz was searching 224M NPS? I stand corrected, the T932 is several times
>faster than any other processor ever made.
>
>>>>I did a branchless FirstOne() in asm a few weeks back here, just to test.
>>>>It used a cmov, and it wasn't slower than the one with a branch.  If the
>>>On a Pentium III?
>>On a pentium IV.
>>Although I did test it on my PIII xeon box, so I guess the answer is "yes" to
>>the III as
>>well...
>
>You "guess"? I never said anything about cmovs being bad for the P4, so why do
>you keep talking about the P4?
>
>-Tom

I tested cmov on P3 earlier today. Latency & throughput are 2 cycles for reg-reg
form. The reg-mem decodes to an additional micro-op which I presume means
another cycle. 2-3 cycles is much, much less than the penalty that occurs on a
branch mispredict. Even the Pentium has a 5-cycle pipeline. The mispredict stall
on a Pentium would be at least 4 cycles. Nothing to cry over, but certainly
slower than any cmov implementation I've seen (except possibly the P4's).

I don't have timings for cmov on the P4 because Intel didn't bother to list
them. However, setcc is 5 clocks latency and 1.5 throughput. The adc and sbb
instructions are 6 clocks latency/2 throughput when used with immediates;
otherwise they are 8 clocks latency/3 throughput in reg-reg form. About the only
thing the P4 does efficiently is execute inefficient code -- convenient I guess
when a major compiler can't do any better than producing 386 code.

-Matt



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.