Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: cmov isn't necessarily good

Author: Matt Taylor

Date: 15:06:29 07/20/03

Go up one level in this thread


On July 19, 2003 at 02:11:34, Tom Kerrigan wrote:

>On July 19, 2003 at 01:11:31, Robert Hyatt wrote:
>
>>On July 18, 2003 at 15:16:27, Tom Kerrigan wrote:
>>
>>>On July 18, 2003 at 04:05:52, Walter Faxon wrote:
>>>
>>>>>; 326  :     if (bbHalf) bb0 = bb1;              // will code as cmov (ideally)
>>>>>
>>>>>	test	ecx, ecx
>>>>>	je	SHORT $L806
>>>>>	mov	eax, DWORD PTR _bb$[esp]
>>>>>$L806:
>>>>>
>>>>
>>>>
>>>>Stupid compiler, not only no cmov
>>>
>>>IIRC, on the P6 (Pentium Pro, Pentium II, Pentium III), the cmov instruction
>>>gets translated into a string of uOps that's equivalent to testing, branching,
>>>and copying.
>>>
>>>In other words, there is no performance benefit (I believe there may actually be
>>>a performance penalty) to using cmov on a P6, and it breaks compatibility with
>>>pre-P6 processors, so it's little wonder the P6-era MS compiler doesn't generate
>>>cmovs.
>>>
>>>-Tom
>>
>>
>>I think the point is that the cmov eliminates any possibility of a branch
>>mis-prediction.  On the long PIV pipeline, that's a significant savings for
>>mis-predicted branches.
>>
>>Since Eugene's example shows that the new MSVC compiler is going to finally
>>emit cmov instructions, I'd assume there is a performance gain for doing
>>so.
>
>Yes, of course, I thought I had made it perfectly clear that I was talking about
>the _P6_ core. I wrote all of them out. Pentium Pro, Pentium II, Pentium III.
>_Not_ Pentium 4.
>
>-Tom

The cmov instruction is 2 u-ops on a P6-core. A jcc + mov is also 2 u-ops, but
it is 1 byte longer. It is possible in poorly-optimized code to see the jcc +
mov beat cmov because the P6-core decoders are crappy. The first decoder can
handle up to 4 u-ops/cycle. The other 2 decoders are limited to 1 u-ops/cycle
each. This means the cmov has to fit in the first decoder. The jcc and mov
instructions will use 2 of the 3 decoders, but they can fit in any of the 3.

I suspect the cmov can also be worse than the jcc and mov on a Pentium 4. Intel
does not list latency for cmov in the P4 optimization manual. They conveniently
decide it is not a common instruction and thus unimportant. The setcc
instruction is similar to cmovcc and has a 5 cycle latency, so I would assume
cmovcc is up in that range. The jcc and mov are both 0.5 clocks, so 1 cycle
total assuming a BTB hit. If your dataset is predictable, jcc & mov will beat
cmovcc if my assumptions are correct.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.