Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Avoiding if ( expr ) and branch misprediction ?

Author: Dann Corbit

Date: 15:09:34 01/19/06

Go up one level in this thread


On January 19, 2006 at 17:11:04, Gerd Isenberg wrote:

>On January 19, 2006 at 15:58:20, Dann Corbit wrote:
>
>>On January 19, 2006 at 15:19:22, Gerd Isenberg wrote:
>>
>>>>>>assert(expression == 0 || expression == 1);
>>>>>>foo[expression]();
>>>>>
>>>>>No - avoid that for simple expressions!
>>>>
>>>>You might be surprised:
>>>>http://chessprogramming.org/cccsearch/ccc.php?art_id=178068
>>>>
>>>>Depends on the compiler, of course.
>>>
>>>Ok, a few years old - it was quite ok on P3 iirc.
>>>To my knowledge on recent processors (amd64) indirect branches with (not
>>>necessarily random) changed target addresses implies a missprediction.
>>>Also i would never replace a simple array access by indrect branch.
>>>
>>> a = x[boolexpression];
>>> a = foo[boolexpression](); // returns either x[false] or x[true]
>>
>>The program:
>>#include <math.h>
>>#include <stdio.h>
>>
>>typedef double  (*f_t) (double);
>>static f_t      f[] =
>>{log, log10, sqrt, cos, cosh, exp, sin, sinh, tan, tanh, 0};
>>
>>static double   accum0 = 0;
>>static double   accum1 = 0;
>>static double   accum2 = 0;
>>
>>
>>void            arr(void)
>>{
>>    int             i;
>>    double          d = 0;
>>    for (i = 0; f[i]; i++) {
>>        d += f[i] (0.5);
>>    }
>>    accum0 += d;
>>}
>>
>>void            poi(void)
>>{
>>    f_t            *flist = f;
>>    double          d = 0;
>>    while (*flist) {
>>        f_t             ff = *flist;
>>        d += ff(0.5);
>>        flist++;
>>    }
>>    accum1 += d;
>>}
>>
>>void            swi(void)
>>{
>>    int             i;
>>    double          d = 0;
>>    for (i = 0; f[i]; i++) {
>>        switch (i) {
>>        case 0:
>>            d += f[0] (0.5);
>>            break;
>>        case 1:
>>            d += f[1] (0.5);
>>            break;
>>        case 2:
>>            d += f[2] (0.5);
>>            break;
>>        case 3:
>>            d += f[3] (0.5);
>>            break;
>>        case 4:
>>            d += f[4] (0.5);
>>            break;
>>        case 5:
>>            d += f[5] (0.5);
>>            break;
>>        case 6:
>>            d += f[6] (0.5);
>>            break;
>>        case 7:
>>            d += f[7] (0.5);
>>            break;
>>        case 8:
>>            d += f[8] (0.5);
>>            break;
>>        case 9:
>>            d += f[9] (0.5);
>>            break;
>>        default:
>>            break;
>>        }
>>    }
>>    accum2 += d;
>>}
>>
>>int             main(void)
>>{
>>    long            i;
>>    for (i = 0; i < 1000000; i++) {
>>        arr();
>>        poi();
>>        swi();
>>    }
>>    printf("%.20g, %.20g, %.20g\n", accum0, accum1, accum2);
>>    return 0;
>>}
>>
>>given compiler flags:
>>/Ox /Ob2 /Oi /Ot /Oy /GT /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D
>>"_VC80_UPGRADE=0x0600" /D "_MBCS" /GF /FD /MT /Zp16 /Gy /Fp".\Release/test.pch"
>>/Fo".\Release/" /Fd".\Release/" /W3 /nologo /c /Zi /Gd /TP /errorReport:prompt
>>
>>for compiler:
>>icrosoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.42 for 80x86
>>opyright (C) Microsoft Corporation.  All rights reserved.
>>
>>Profile data:
>>Segment	Offset	Address	Size	Function	Class	Full Name	Module	Source File	Timer
>>Samples	Resolved Module Path	Timer events	Timer %
>>1	7536	11632	596	CItan_pentium4		CItan_pentium4	test.exe		156	c:\tmp\release\test.exe	156000	10.96275474
>>1	3904	8000	103	trandisp1		trandisp1	test.exe		149	c:\tmp\release\test.exe	149000	10.47083626
>>1	7019	11115	406	ctrandisp2		ctrandisp2	test.exe		104	c:\tmp\release\test.exe	104000	7.308503162
>>1	5488	9584	620	CIlog_pentium4		CIlog_pentium4	test.exe		101	c:\tmp\release\test.exe	101000	7.097680956
>>1	5056	9152	428	CIcos_pentium4		CIcos_pentium4	test.exe		100	c:\tmp\release\test.exe	100000	7.027406887
>>1	14384	18480	456	CIsin_pentium4		CIsin_pentium4	test.exe		86	c:\tmp\release\test.exe	86000	6.043569923
>>1	13024	17120	644	CIlog10_pentium4		CIlog10_pentium4	test.exe		85	c:\tmp\release\test.exe	85000	5.973295854
>>1	192	4288	408	swi		swi	test.exe	c:\tmp\test.c	85	c:\tmp\release\test.exe	85000	5.973295854
>>1	7476	11572	60	fload		fload	test.exe		80	c:\tmp\release\test.exe	80000	5.621925509
>>1	1600	5696	63	tan		tan	test.exe		66	c:\tmp\release\test.exe	66000	4.638088545
>>1	13680	17776	701	CIexp_pentium4		CIexp_pentium4	test.exe		65	c:\tmp\release\test.exe	65000	4.567814476
>>1	7425	11521	51	ctrandisp1		ctrandisp1	test.exe		49	c:\tmp\release\test.exe	49000	3.443429375
>>1	2592	6688	59	exp		exp	test.exe		45	c:\tmp\release\test.exe	45000	3.162333099
>>1	1216	5312	63	log		log	test.exe		37	c:\tmp\release\test.exe	37000	2.600140548
>>1	720	4816	186	CIsqrt		CIsqrt	test.exe		35	c:\tmp\release\test.exe	35000	2.45959241
>>1	912	5008	63	cos		cos	test.exe		35	c:\tmp\release\test.exe	35000	2.45959241
>>1	0	4096	96	arr		arr	test.exe	c:\tmp\test.c	29	c:\tmp\release\test.exe	29000	2.037947997
>>1	2256	6352	63	log10		log10	test.exe		28	c:\tmp\release\test.exe	28000	1.967673928
>>1	2736	6832	63	sin		sin	test.exe		28	c:\tmp\release\test.exe	28000	1.967673928
>>1	96	4192	92	poi		poi	test.exe	c:\tmp\test.c	20	c:\tmp\release\test.exe	20000	1.405481377
>>1	4677	8773	67	fload_withFB		fload_withFB	test.exe		11	c:\tmp\release\test.exe	11000	0.773014758
>>1	1569	5665	7	tanh		tanh	test.exe		9	c:\tmp\release\test.exe	9000	0.63246662
>>1	1562	5658	7	cosh		cosh	test.exe		8	c:\tmp\release\test.exe	8000	0.562192551
>>1	1552	5648	10	sinh		sinh	test.exe		8	c:\tmp\release\test.exe	8000	0.562192551
>>1	4779	8875	42	math_exit		math_exit	test.exe		2	c:\tmp\release\test.exe	2000	0.140548138
>>1	608	4704	88	main		main	test.exe	c:\tmp\test.c	2	c:\tmp\release\test.exe	2000	0.140548138
>>
>>Switch is 5.97% of execution time.
>>Pointer is 1.4% of execution time.
>
>
>This has nothing to do with getting preinitialized data from arrays.
>It is switch aka indirect jump verus indirect call.
>You may try an explicite if-else-else cascade ;-)

For sure, the if else cascade will be far, far slower in an even distribution
like this.  On average there will be N/2 mispredicted branches where N is the
nuber of distinct cases.  If one function happens almost all the time, we can
put it first and get a big benefit from that.

The switch has the advantage that the function pointer is hardwired.  Which
ought to favor the switch.

>>
>>4 times faster.
>
>Strange!? Switch is most likely conditional jmp with a leading range check. So i
>don't trust that profiling so much with soo short routines amd probably huge
>"sampling" errors (i guess). Also the return overhead "pairs" with the x87
>instructions, which favors the call a bit.

I trust the profile output because it jibes with what I have seen before.  On
the other hand, the Intel compiler behaves differently here.

In short, I think it is a mistake to favor goofy tricks to try to make something
faster.  We should code in the way that is the most clear.  If there is some
spot that the profiler tells us is very slow and we cannot figure out a good
algorithm improvment, then we can look at cheesy hacks like those mentioned here
to improve it.

>>
>>Of course, as always, YMMV.
>>
>>Hardware was:
>> CPU Identification utility v1.11                 (c) 1997-2005 Jan Steunebrink
>>
>>„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ
>> CPU Vendor and Model: AMD Athlon 64 2800+-3700+
>> Internal CPU speed  : 2199.4 MHz  (using internal Time Stamp Counter)
>> Clock Multiplier    : Available only in Real Mode!
>> CPU-ID Vendor string: AuthenticAMD
>> CPU-ID Name string  : AMD Athlon(tm) 64 Processor 3400+
>> CPU-ID Signature    : 0F4A
>>                       „ „ „ „¤„Ÿ Stepping or sub-model no.
>>                       „ „ „¤„Ÿ Model: Indicates CPU Model and 486 L1 cache mode
>>                       „ „¤„Ÿ Family: 4=486, Am5x86, Cx5x86
>>                       „            5=Pentium, Nx586, Cx6x86, K5/K6, C6, mP6
>>                       „            6=PentiumPro/II/III, CxMII/III, Athlon, C3
>>                       „            F=Pentium4, Athlon64
>>                       „¤„Ÿ Type: 0=Standard, 1=Overdrive, 2=2nd Dual Pentium
>> Current CPU mode    : Protected
>> Internal (L1) cache : Enabled in Write-Back mode



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.