Author: Matt Taylor
Date: 15:06:29 07/20/03
Go up one level in this thread
On July 19, 2003 at 02:11:34, Tom Kerrigan wrote: >On July 19, 2003 at 01:11:31, Robert Hyatt wrote: > >>On July 18, 2003 at 15:16:27, Tom Kerrigan wrote: >> >>>On July 18, 2003 at 04:05:52, Walter Faxon wrote: >>> >>>>>; 326 : if (bbHalf) bb0 = bb1; // will code as cmov (ideally) >>>>> >>>>> test ecx, ecx >>>>> je SHORT $L806 >>>>> mov eax, DWORD PTR _bb$[esp] >>>>>$L806: >>>>> >>>> >>>> >>>>Stupid compiler, not only no cmov >>> >>>IIRC, on the P6 (Pentium Pro, Pentium II, Pentium III), the cmov instruction >>>gets translated into a string of uOps that's equivalent to testing, branching, >>>and copying. >>> >>>In other words, there is no performance benefit (I believe there may actually be >>>a performance penalty) to using cmov on a P6, and it breaks compatibility with >>>pre-P6 processors, so it's little wonder the P6-era MS compiler doesn't generate >>>cmovs. >>> >>>-Tom >> >> >>I think the point is that the cmov eliminates any possibility of a branch >>mis-prediction. On the long PIV pipeline, that's a significant savings for >>mis-predicted branches. >> >>Since Eugene's example shows that the new MSVC compiler is going to finally >>emit cmov instructions, I'd assume there is a performance gain for doing >>so. > >Yes, of course, I thought I had made it perfectly clear that I was talking about >the _P6_ core. I wrote all of them out. Pentium Pro, Pentium II, Pentium III. >_Not_ Pentium 4. > >-Tom The cmov instruction is 2 u-ops on a P6-core. A jcc + mov is also 2 u-ops, but it is 1 byte longer. It is possible in poorly-optimized code to see the jcc + mov beat cmov because the P6-core decoders are crappy. The first decoder can handle up to 4 u-ops/cycle. The other 2 decoders are limited to 1 u-ops/cycle each. This means the cmov has to fit in the first decoder. The jcc and mov instructions will use 2 of the 3 decoders, but they can fit in any of the 3. I suspect the cmov can also be worse than the jcc and mov on a Pentium 4. Intel does not list latency for cmov in the P4 optimization manual. They conveniently decide it is not a common instruction and thus unimportant. The setcc instruction is similar to cmovcc and has a 5 cycle latency, so I would assume cmovcc is up in that range. The jcc and mov are both 0.5 clocks, so 1 cycle total assuming a BTB hit. If your dataset is predictable, jcc & mov will beat cmovcc if my assumptions are correct. -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.