Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: cmov isn't necessarily good

Author: Robert Hyatt
Date: 07:13:44 07/21/03
On July 20, 2003 at 18:06:29, Matt Taylor wrote:

>On July 19, 2003 at 02:11:34, Tom Kerrigan wrote:
>
>>On July 19, 2003 at 01:11:31, Robert Hyatt wrote:
>>
>>>On July 18, 2003 at 15:16:27, Tom Kerrigan wrote:
>>>
>>>>On July 18, 2003 at 04:05:52, Walter Faxon wrote:
>>>>
>>>>>>; 326  :     if (bbHalf) bb0 = bb1;              // will code as cmov (ideally)
>>>>>>
>>>>>>	test	ecx, ecx
>>>>>>	je	SHORT $L806
>>>>>>	mov	eax, DWORD PTR _bb$[esp]
>>>>>>$L806:
>>>>>>
>>>>>
>>>>>
>>>>>Stupid compiler, not only no cmov
>>>>
>>>>IIRC, on the P6 (Pentium Pro, Pentium II, Pentium III), the cmov instruction
>>>>gets translated into a string of uOps that's equivalent to testing, branching,
>>>>and copying.
>>>>
>>>>In other words, there is no performance benefit (I believe there may actually be
>>>>a performance penalty) to using cmov on a P6, and it breaks compatibility with
>>>>pre-P6 processors, so it's little wonder the P6-era MS compiler doesn't generate
>>>>cmovs.
>>>>
>>>>-Tom
>>>
>>>
>>>I think the point is that the cmov eliminates any possibility of a branch
>>>mis-prediction.  On the long PIV pipeline, that's a significant savings for
>>>mis-predicted branches.
>>>
>>>Since Eugene's example shows that the new MSVC compiler is going to finally
>>>emit cmov instructions, I'd assume there is a performance gain for doing
>>>so.
>>
>>Yes, of course, I thought I had made it perfectly clear that I was talking about
>>the _P6_ core. I wrote all of them out. Pentium Pro, Pentium II, Pentium III.
>>_Not_ Pentium 4.
>>
>>-Tom
>
>The cmov instruction is 2 u-ops on a P6-core. A jcc + mov is also 2 u-ops, but
>it is 1 byte longer. It is possible in poorly-optimized code to see the jcc +
>mov beat cmov because the P6-core decoders are crappy. The first decoder can
>handle up to 4 u-ops/cycle. The other 2 decoders are limited to 1 u-ops/cycle
>each. This means the cmov has to fit in the first decoder. The jcc and mov
>instructions will use 2 of the 3 decoders, but they can fit in any of the 3.
>
>I suspect the cmov can also be worse than the jcc and mov on a Pentium 4. Intel
>does not list latency for cmov in the P4 optimization manual. They conveniently
>decide it is not a common instruction and thus unimportant. The setcc
>instruction is similar to cmovcc and has a 5 cycle latency, so I would assume
>cmovcc is up in that range. The jcc and mov are both 0.5 clocks, so 1 cycle
>total assuming a BTB hit. If your dataset is predictable, jcc & mov will beat
>cmovcc if my assumptions are correct.
>
>-Matt


You overlook the most important difference.  If the jump is mis-predicted, then
a lot of work gets flushed.  With CMOV, nothing gets mis-predicted, and nothing
gets dumped.

This is a 50-50 type operation, in the case of chess dynamic prediction using
a BTB might even be worse than static prediction, because the thing flip/flops
every node in the tree (the PIV can catch this "pattern" but earlier Intels do
not).  It is possible that a BTB implementation will mis-predict the jcc every
time it hits it due to this back and forth problem for odd/even plies.  That's
why CMOV seems like a good idea to me.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.