Author: Matt Taylor
Date: 20:06:47 02/12/03
Go up one level in this thread
On February 12, 2003 at 16:13:29, Tom Kerrigan wrote: >On February 12, 2003 at 11:47:21, Robert Hyatt wrote: > >>But I don't >>think OOOE is _nearly_ as important for decent architectures as it is for >>architectures >>that have significant design problems like X86. > >Fine, it's a 30% benefit to Alpha and MIPS, maybe it's 40% for x86... > >>>>The Cray T932 was the last 64 bit machine they built that I used. And it >>>How many NPS does Crafty get on it? >>about 7M. >>And that was Cray Blitz, not Crafty. I have not tried to run Crafty on a Cray, > >7M per processor? How many processors did those Crays come with? 32? So Crazy >Blitz was searching 224M NPS? I stand corrected, the T932 is several times >faster than any other processor ever made. > >>>>I did a branchless FirstOne() in asm a few weeks back here, just to test. >>>>It used a cmov, and it wasn't slower than the one with a branch. If the >>>On a Pentium III? >>On a pentium IV. >>Although I did test it on my PIII xeon box, so I guess the answer is "yes" to >>the III as >>well... > >You "guess"? I never said anything about cmovs being bad for the P4, so why do >you keep talking about the P4? > >-Tom I tested cmov on P3 earlier today. Latency & throughput are 2 cycles for reg-reg form. The reg-mem decodes to an additional micro-op which I presume means another cycle. 2-3 cycles is much, much less than the penalty that occurs on a branch mispredict. Even the Pentium has a 5-cycle pipeline. The mispredict stall on a Pentium would be at least 4 cycles. Nothing to cry over, but certainly slower than any cmov implementation I've seen (except possibly the P4's). I don't have timings for cmov on the P4 because Intel didn't bother to list them. However, setcc is 5 clocks latency and 1.5 throughput. The adc and sbb instructions are 6 clocks latency/2 throughput when used with immediates; otherwise they are 8 clocks latency/3 throughput in reg-reg form. About the only thing the P4 does efficiently is execute inefficient code -- convenient I guess when a major compiler can't do any better than producing 386 code. -Matt
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.