Author: Dann Corbit
Date: 15:09:34 01/19/06
Go up one level in this thread
On January 19, 2006 at 17:11:04, Gerd Isenberg wrote:
>On January 19, 2006 at 15:58:20, Dann Corbit wrote:
>
>>On January 19, 2006 at 15:19:22, Gerd Isenberg wrote:
>>
>>>>>>assert(expression == 0 || expression == 1);
>>>>>>foo[expression]();
>>>>>
>>>>>No - avoid that for simple expressions!
>>>>
>>>>You might be surprised:
>>>>http://chessprogramming.org/cccsearch/ccc.php?art_id=178068
>>>>
>>>>Depends on the compiler, of course.
>>>
>>>Ok, a few years old - it was quite ok on P3 iirc.
>>>To my knowledge on recent processors (amd64) indirect branches with (not
>>>necessarily random) changed target addresses implies a missprediction.
>>>Also i would never replace a simple array access by indrect branch.
>>>
>>> a = x[boolexpression];
>>> a = foo[boolexpression](); // returns either x[false] or x[true]
>>
>>The program:
>>#include <math.h>
>>#include <stdio.h>
>>
>>typedef double (*f_t) (double);
>>static f_t f[] =
>>{log, log10, sqrt, cos, cosh, exp, sin, sinh, tan, tanh, 0};
>>
>>static double accum0 = 0;
>>static double accum1 = 0;
>>static double accum2 = 0;
>>
>>
>>void arr(void)
>>{
>> int i;
>> double d = 0;
>> for (i = 0; f[i]; i++) {
>> d += f[i] (0.5);
>> }
>> accum0 += d;
>>}
>>
>>void poi(void)
>>{
>> f_t *flist = f;
>> double d = 0;
>> while (*flist) {
>> f_t ff = *flist;
>> d += ff(0.5);
>> flist++;
>> }
>> accum1 += d;
>>}
>>
>>void swi(void)
>>{
>> int i;
>> double d = 0;
>> for (i = 0; f[i]; i++) {
>> switch (i) {
>> case 0:
>> d += f[0] (0.5);
>> break;
>> case 1:
>> d += f[1] (0.5);
>> break;
>> case 2:
>> d += f[2] (0.5);
>> break;
>> case 3:
>> d += f[3] (0.5);
>> break;
>> case 4:
>> d += f[4] (0.5);
>> break;
>> case 5:
>> d += f[5] (0.5);
>> break;
>> case 6:
>> d += f[6] (0.5);
>> break;
>> case 7:
>> d += f[7] (0.5);
>> break;
>> case 8:
>> d += f[8] (0.5);
>> break;
>> case 9:
>> d += f[9] (0.5);
>> break;
>> default:
>> break;
>> }
>> }
>> accum2 += d;
>>}
>>
>>int main(void)
>>{
>> long i;
>> for (i = 0; i < 1000000; i++) {
>> arr();
>> poi();
>> swi();
>> }
>> printf("%.20g, %.20g, %.20g\n", accum0, accum1, accum2);
>> return 0;
>>}
>>
>>given compiler flags:
>>/Ox /Ob2 /Oi /Ot /Oy /GT /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D
>>"_VC80_UPGRADE=0x0600" /D "_MBCS" /GF /FD /MT /Zp16 /Gy /Fp".\Release/test.pch"
>>/Fo".\Release/" /Fd".\Release/" /W3 /nologo /c /Zi /Gd /TP /errorReport:prompt
>>
>>for compiler:
>>icrosoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.42 for 80x86
>>opyright (C) Microsoft Corporation. All rights reserved.
>>
>>Profile data:
>>Segment Offset Address Size Function Class Full Name Module Source File Timer
>>Samples Resolved Module Path Timer events Timer %
>>1 7536 11632 596 CItan_pentium4 CItan_pentium4 test.exe 156 c:\tmp\release\test.exe 156000 10.96275474
>>1 3904 8000 103 trandisp1 trandisp1 test.exe 149 c:\tmp\release\test.exe 149000 10.47083626
>>1 7019 11115 406 ctrandisp2 ctrandisp2 test.exe 104 c:\tmp\release\test.exe 104000 7.308503162
>>1 5488 9584 620 CIlog_pentium4 CIlog_pentium4 test.exe 101 c:\tmp\release\test.exe 101000 7.097680956
>>1 5056 9152 428 CIcos_pentium4 CIcos_pentium4 test.exe 100 c:\tmp\release\test.exe 100000 7.027406887
>>1 14384 18480 456 CIsin_pentium4 CIsin_pentium4 test.exe 86 c:\tmp\release\test.exe 86000 6.043569923
>>1 13024 17120 644 CIlog10_pentium4 CIlog10_pentium4 test.exe 85 c:\tmp\release\test.exe 85000 5.973295854
>>1 192 4288 408 swi swi test.exe c:\tmp\test.c 85 c:\tmp\release\test.exe 85000 5.973295854
>>1 7476 11572 60 fload fload test.exe 80 c:\tmp\release\test.exe 80000 5.621925509
>>1 1600 5696 63 tan tan test.exe 66 c:\tmp\release\test.exe 66000 4.638088545
>>1 13680 17776 701 CIexp_pentium4 CIexp_pentium4 test.exe 65 c:\tmp\release\test.exe 65000 4.567814476
>>1 7425 11521 51 ctrandisp1 ctrandisp1 test.exe 49 c:\tmp\release\test.exe 49000 3.443429375
>>1 2592 6688 59 exp exp test.exe 45 c:\tmp\release\test.exe 45000 3.162333099
>>1 1216 5312 63 log log test.exe 37 c:\tmp\release\test.exe 37000 2.600140548
>>1 720 4816 186 CIsqrt CIsqrt test.exe 35 c:\tmp\release\test.exe 35000 2.45959241
>>1 912 5008 63 cos cos test.exe 35 c:\tmp\release\test.exe 35000 2.45959241
>>1 0 4096 96 arr arr test.exe c:\tmp\test.c 29 c:\tmp\release\test.exe 29000 2.037947997
>>1 2256 6352 63 log10 log10 test.exe 28 c:\tmp\release\test.exe 28000 1.967673928
>>1 2736 6832 63 sin sin test.exe 28 c:\tmp\release\test.exe 28000 1.967673928
>>1 96 4192 92 poi poi test.exe c:\tmp\test.c 20 c:\tmp\release\test.exe 20000 1.405481377
>>1 4677 8773 67 fload_withFB fload_withFB test.exe 11 c:\tmp\release\test.exe 11000 0.773014758
>>1 1569 5665 7 tanh tanh test.exe 9 c:\tmp\release\test.exe 9000 0.63246662
>>1 1562 5658 7 cosh cosh test.exe 8 c:\tmp\release\test.exe 8000 0.562192551
>>1 1552 5648 10 sinh sinh test.exe 8 c:\tmp\release\test.exe 8000 0.562192551
>>1 4779 8875 42 math_exit math_exit test.exe 2 c:\tmp\release\test.exe 2000 0.140548138
>>1 608 4704 88 main main test.exe c:\tmp\test.c 2 c:\tmp\release\test.exe 2000 0.140548138
>>
>>Switch is 5.97% of execution time.
>>Pointer is 1.4% of execution time.
>
>
>This has nothing to do with getting preinitialized data from arrays.
>It is switch aka indirect jump verus indirect call.
>You may try an explicite if-else-else cascade ;-)
For sure, the if else cascade will be far, far slower in an even distribution
like this. On average there will be N/2 mispredicted branches where N is the
nuber of distinct cases. If one function happens almost all the time, we can
put it first and get a big benefit from that.
The switch has the advantage that the function pointer is hardwired. Which
ought to favor the switch.
>>
>>4 times faster.
>
>Strange!? Switch is most likely conditional jmp with a leading range check. So i
>don't trust that profiling so much with soo short routines amd probably huge
>"sampling" errors (i guess). Also the return overhead "pairs" with the x87
>instructions, which favors the call a bit.
I trust the profile output because it jibes with what I have seen before. On
the other hand, the Intel compiler behaves differently here.
In short, I think it is a mistake to favor goofy tricks to try to make something
faster. We should code in the way that is the most clear. If there is some
spot that the profiler tells us is very slow and we cannot figure out a good
algorithm improvment, then we can look at cheesy hacks like those mentioned here
to improve it.
>>
>>Of course, as always, YMMV.
>>
>>Hardware was:
>> CPU Identification utility v1.11 (c) 1997-2005 Jan Steunebrink
>>
>>„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ„Ÿ
>> CPU Vendor and Model: AMD Athlon 64 2800+-3700+
>> Internal CPU speed : 2199.4 MHz (using internal Time Stamp Counter)
>> Clock Multiplier : Available only in Real Mode!
>> CPU-ID Vendor string: AuthenticAMD
>> CPU-ID Name string : AMD Athlon(tm) 64 Processor 3400+
>> CPU-ID Signature : 0F4A
>> „ „ „ „¤„Ÿ Stepping or sub-model no.
>> „ „ „¤„Ÿ Model: Indicates CPU Model and 486 L1 cache mode
>> „ „¤„Ÿ Family: 4=486, Am5x86, Cx5x86
>> „ 5=Pentium, Nx586, Cx6x86, K5/K6, C6, mP6
>> „ 6=PentiumPro/II/III, CxMII/III, Athlon, C3
>> „ F=Pentium4, Athlon64
>> „¤„Ÿ Type: 0=Standard, 1=Overdrive, 2=2nd Dual Pentium
>> Current CPU mode : Protected
>> Internal (L1) cache : Enabled in Write-Back mode
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.