Author: Vincent Diepeveen
Date: 12:45:49 11/23/05
Go up one level in this thread
On November 22, 2005 at 20:13:04, Eugene Nalimov wrote: >On November 22, 2005 at 19:07:16, Dieter Buerssner wrote: > >>On November 21, 2005 at 20:00:24, Eugene Nalimov wrote: >> >>>On November 21, 2005 at 18:10:54, Dieter Buerssner wrote: >>> >>>>[...] >>>>I guess, you mean this as a substitution for >>>> if (depth < 0) >>>> fm = fm1; >>>> else >>>> fm = fm2; >>>> >>>>I am surprised, that compilers are not able to do this themselves. I >>> >>>I several times tried to modify Visual C to recognize additional cases where we >>>should emit conditional moves (last time was probably a year ago for >>>x64-targeting compiler). Every time I could demonstrate win on a small >>>artificial test case, but every large real world program either showed no gain >>>or slowed down. >>> >>>I suspect there are several reasons for this: >>>* branch predictors are good, and majority of branches can be correctly >>>predicted >>>* CMOV is long instruction; short branch is shorter, so program with less CMOVs >>>fits better into cache >>>* there is no 8-bit form of CMOV >>>* >* for invalid address "CMOV reg, memory" will give you access violation even if >>>condition is false. >> >>I don't really understand the last reason. >> >>It might be possible, to detect (guess) cases, that are the real inner loops. >>That should practically get rid of "CMOV is long instruction; short branch is >>shorter, so program with less CMOVs fits better into cache". > >For lot of programs there are no inner loops where program spends majority of >its time. Examples are operating systems, databases, compilers, office >applications, etc. For such programs profile is flat -- i.e. you don't have 3 >functions that collectively use (say) 70% of total execution time. Instead you >have 150 "hot" functions, where hottest takes less than 2% of total time, and >majority take 0.5-1%. There are no hot loops -- typical number of iterations is >less than 5. That is very compiler-unfriendly situation; you cannot just >generate locally optimal code ignoring its size. For such cases best approach is >to generate reasonable (though not locally fastest) code, and carefully try to >generate smallest possible code. You have to carefuly limit optimizations >increasing code size such as loop unrolling, inlining, etc. > >That is exactly situation where Visual C shines -- our main customer is >Microsoft itself. Windows, IE, MS Office, MS SQL Server, etc. are all such >applications. I heard one anecdotal evidence about customer who was unhappy with >code we generates for their server application, so they spend lot of resources >migrating to the compiler provided by (famous) CPU vendor. Resulting executable >run much slower than executable produced by Visual C; difference was tens of >percents. > >Generating CMOVs for such applications can grow code size by (say) 1-2%. It >happens that less branch mispredicts does not compensate for more frequent >I-cache misses. > >(The most pathological case I analyzed was one of the SpecInt2k benchmarks. >Hottest function in the program is huge recursive function. It containes one >loop. That loop contains switch statement, and there are lot of recursive calls >inside that switch. Average number of loop iterations is less than 2. As a >result, the more code you hoist out of the loop, the slower program becomes -- >function prologue is (almost) hottest code in the function). > >>About "there is no 8-bit form of CMOV". If that really hurts, it should be >>rather easy, to only use them in the other cases? > >That is exactly what compilers are doing. I was just pointing that there are >omissions making CMOVs less useful. > >>Thanks for your interesting input, >>Dieter Are you sure that the only reasons to not use CMOV's are the above reasons. Isn't another real important reason the hard fact that it's just 2 cycles at AMD hardware versus 7 cycles at Prescott? That means a 3.5 times speed advantage of AMD over Intel. Intel C++ mercilous *does* generate CMOV's when you optimize for other cpu's than P4. M$ hardly is generating them, nor having an optimization, as far as i know that forces to use them. Diep is faster with CMOV's than without. Some prime searching code i've got here is 2 times faster nearly with CMOV's. GCC generates fastest code for it of all compilers. Intel c++ comes close when using pentium-m optimizations and generating code for P3 compatible type cpu's. Visual c++ is 50% slower there because it simply doesn't want to generate CMOV's *ever*. Wintel.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.