Author: Gerd Isenberg
Date: 07:05:22 07/07/03
Go up one level in this thread
On July 07, 2003 at 09:34:58, Bo Persson wrote: >On July 07, 2003 at 08:48:39, Gerd Isenberg wrote: > >>at least one colleague has the same strange effect than Dieter: >> >>Gerd >> >> >>Gerd P4 2.4GHz: >> nothing 3951541892 13.390 >> abs() 1713113360 13.141 >> simple_abs() 1713113360 19.562 >> omid_abs() 1713113360 13.672 >> sbb_abs() 1713113360 17.969 >> cdq_abs() 1713113360 17.625 >> fish_abs() 1713113360 21.750 >> sar_abs() 1713113360 16.984 >> cmovl_abs() 1713113360 16.782 >> cmovs_abs() 1713113360 16.781 >> > >This isn't as strange as it might seem. We are trying to time an *extremely* >small piece of code. The instructions selected by the compiler actually executes >att a different speed on different processors. > >I have MSVC 7.1 where do_nothing results in: > >; 304 : for (i = 0; i < MAX_ITERATIONS; ++i) { >; 305 : >; 306 : // subtract so we get both positive and negative numbers >; 307 : int a = rand() - 16384; > > 00020 e8 00 00 00 00 call _rand > 00025 4f dec edi > >; 308 : >; 309 : sum += a; > > 00026 8d b4 06 00 c0 > ff ff lea esi, DWORD PTR [esi+eax-16384] > 0002d 75 f1 jne SHORT $L10491 > >; 310 : } > >Here an LEA is used to compute sum + a - 16384 in a single instruction! > >while test_abs is just slightly different: > >; 25 : for (i = 0; i < MAX_ITERATIONS; ++i) { >; 26 : >; 27 : // subtract so we get both positive and negative numbers >; 28 : int a = rand() - 16384; > > 00020 e8 00 00 00 00 call _rand > 00025 2d 00 40 00 00 sub eax, 16384 ; 00004000H > >; 29 : >; 30 : sum += abs(a); > > 0002a 99 cdq > 0002b 33 c2 xor eax, edx > 0002d 2b c2 sub eax, edx > 0002f 03 f0 add esi, eax > 00031 4f dec edi > 00032 75 ec jne SHORT $L10356 > > >On a P4 the LEA instruction is broken up into several (but unspecified) >micro-ops. It is not fast - in fact Intel says that it is no longer an >optimization to use it! On the PIII, of course, it has dedicated hardware... > aha, yes but that's not the point, see below. >Except for the CDQ, all the other instructions are in the core RISC set, that >executes at up to 3 instructions per clock on a P4. > > >So doing something fast *can* be quicker than doing nothing slowly. :-) at least most often ;-) > > >Bo Persson >bop2@telia.com Hi Bo, this omid_abs was strange: Sebastians "High Media" P4 2GHz nothing 3951541892 18.666 abs() 1713113360 19.959 simple_abs() 1713113360 25.487 omid_abs() 1713113360 11.116 !!!!!!!! sbb_abs() 1713113360 24.365 cdq_abs() 1713113360 24.235 fish_abs() 1713113360 29.522 sar_abs() 1713113360 23.083 cmovl_abs() 1713113360 24.325 cmovs_abs() 1713113360 24.325 and this from Dieter's P4: MSVC, Russel's code, -Ox2 -Ob2 -G6 -Gr -GF nothing 3951541892 13.309 abs() 1713113360 14.400 simple_abs() 1713113360 17.936 omid_abs() 1713113360 7.932 !!! Yes, reprocucable sbb_abs() 1713113360 17.144 cdq_abs() 1713113360 17.555 fish_abs() 1713113360 20.900 sar_abs() 1713113360 16.464 cmovl_abs() 1713113360 17.365 cmovs_abs() 1713113360 17.345 http://www.talkchess.com/forums/1/message.html?304949 and following. Do you have an explanation for adding code and doubling the speed? Regards, Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.