Author: Robert Hyatt
Date: 07:13:44 07/21/03
Go up one level in this thread
On July 20, 2003 at 18:06:29, Matt Taylor wrote: >On July 19, 2003 at 02:11:34, Tom Kerrigan wrote: > >>On July 19, 2003 at 01:11:31, Robert Hyatt wrote: >> >>>On July 18, 2003 at 15:16:27, Tom Kerrigan wrote: >>> >>>>On July 18, 2003 at 04:05:52, Walter Faxon wrote: >>>> >>>>>>; 326 : if (bbHalf) bb0 = bb1; // will code as cmov (ideally) >>>>>> >>>>>> test ecx, ecx >>>>>> je SHORT $L806 >>>>>> mov eax, DWORD PTR _bb$[esp] >>>>>>$L806: >>>>>> >>>>> >>>>> >>>>>Stupid compiler, not only no cmov >>>> >>>>IIRC, on the P6 (Pentium Pro, Pentium II, Pentium III), the cmov instruction >>>>gets translated into a string of uOps that's equivalent to testing, branching, >>>>and copying. >>>> >>>>In other words, there is no performance benefit (I believe there may actually be >>>>a performance penalty) to using cmov on a P6, and it breaks compatibility with >>>>pre-P6 processors, so it's little wonder the P6-era MS compiler doesn't generate >>>>cmovs. >>>> >>>>-Tom >>> >>> >>>I think the point is that the cmov eliminates any possibility of a branch >>>mis-prediction. On the long PIV pipeline, that's a significant savings for >>>mis-predicted branches. >>> >>>Since Eugene's example shows that the new MSVC compiler is going to finally >>>emit cmov instructions, I'd assume there is a performance gain for doing >>>so. >> >>Yes, of course, I thought I had made it perfectly clear that I was talking about >>the _P6_ core. I wrote all of them out. Pentium Pro, Pentium II, Pentium III. >>_Not_ Pentium 4. >> >>-Tom > >The cmov instruction is 2 u-ops on a P6-core. A jcc + mov is also 2 u-ops, but >it is 1 byte longer. It is possible in poorly-optimized code to see the jcc + >mov beat cmov because the P6-core decoders are crappy. The first decoder can >handle up to 4 u-ops/cycle. The other 2 decoders are limited to 1 u-ops/cycle >each. This means the cmov has to fit in the first decoder. The jcc and mov >instructions will use 2 of the 3 decoders, but they can fit in any of the 3. > >I suspect the cmov can also be worse than the jcc and mov on a Pentium 4. Intel >does not list latency for cmov in the P4 optimization manual. They conveniently >decide it is not a common instruction and thus unimportant. The setcc >instruction is similar to cmovcc and has a 5 cycle latency, so I would assume >cmovcc is up in that range. The jcc and mov are both 0.5 clocks, so 1 cycle >total assuming a BTB hit. If your dataset is predictable, jcc & mov will beat >cmovcc if my assumptions are correct. > >-Matt You overlook the most important difference. If the jump is mis-predicted, then a lot of work gets flushed. With CMOV, nothing gets mis-predicted, and nothing gets dumped. This is a 50-50 type operation, in the case of chess dynamic prediction using a BTB might even be worse than static prediction, because the thing flip/flops every node in the tree (the PIV can catch this "pattern" but earlier Intels do not). It is possible that a BTB implementation will mis-predict the jcc every time it hits it due to this back and forth problem for odd/even plies. That's why CMOV seems like a good idea to me.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.