Author: Gerd Isenberg
Date: 07:05:22 07/07/03
Go up one level in this thread
On July 07, 2003 at 09:34:58, Bo Persson wrote:
>On July 07, 2003 at 08:48:39, Gerd Isenberg wrote:
>
>>at least one colleague has the same strange effect than Dieter:
>>
>>Gerd
>>
>>
>>Gerd P4 2.4GHz:
>> nothing 3951541892 13.390
>> abs() 1713113360 13.141
>> simple_abs() 1713113360 19.562
>> omid_abs() 1713113360 13.672
>> sbb_abs() 1713113360 17.969
>> cdq_abs() 1713113360 17.625
>> fish_abs() 1713113360 21.750
>> sar_abs() 1713113360 16.984
>> cmovl_abs() 1713113360 16.782
>> cmovs_abs() 1713113360 16.781
>>
>
>This isn't as strange as it might seem. We are trying to time an *extremely*
>small piece of code. The instructions selected by the compiler actually executes
>att a different speed on different processors.
>
>I have MSVC 7.1 where do_nothing results in:
>
>; 304 : for (i = 0; i < MAX_ITERATIONS; ++i) {
>; 305 :
>; 306 : // subtract so we get both positive and negative numbers
>; 307 : int a = rand() - 16384;
>
> 00020 e8 00 00 00 00 call _rand
> 00025 4f dec edi
>
>; 308 :
>; 309 : sum += a;
>
> 00026 8d b4 06 00 c0
> ff ff lea esi, DWORD PTR [esi+eax-16384]
> 0002d 75 f1 jne SHORT $L10491
>
>; 310 : }
>
>Here an LEA is used to compute sum + a - 16384 in a single instruction!
>
>while test_abs is just slightly different:
>
>; 25 : for (i = 0; i < MAX_ITERATIONS; ++i) {
>; 26 :
>; 27 : // subtract so we get both positive and negative numbers
>; 28 : int a = rand() - 16384;
>
> 00020 e8 00 00 00 00 call _rand
> 00025 2d 00 40 00 00 sub eax, 16384 ; 00004000H
>
>; 29 :
>; 30 : sum += abs(a);
>
> 0002a 99 cdq
> 0002b 33 c2 xor eax, edx
> 0002d 2b c2 sub eax, edx
> 0002f 03 f0 add esi, eax
> 00031 4f dec edi
> 00032 75 ec jne SHORT $L10356
>
>
>On a P4 the LEA instruction is broken up into several (but unspecified)
>micro-ops. It is not fast - in fact Intel says that it is no longer an
>optimization to use it! On the PIII, of course, it has dedicated hardware...
>
aha, yes but that's not the point, see below.
>Except for the CDQ, all the other instructions are in the core RISC set, that
>executes at up to 3 instructions per clock on a P4.
>
>
>So doing something fast *can* be quicker than doing nothing slowly. :-)
at least most often ;-)
>
>
>Bo Persson
>bop2@telia.com
Hi Bo,
this omid_abs was strange:
Sebastians "High Media" P4 2GHz
nothing 3951541892 18.666
abs() 1713113360 19.959
simple_abs() 1713113360 25.487
omid_abs() 1713113360 11.116 !!!!!!!!
sbb_abs() 1713113360 24.365
cdq_abs() 1713113360 24.235
fish_abs() 1713113360 29.522
sar_abs() 1713113360 23.083
cmovl_abs() 1713113360 24.325
cmovs_abs() 1713113360 24.325
and this from Dieter's P4:
MSVC, Russel's code, -Ox2 -Ob2 -G6 -Gr -GF
nothing 3951541892 13.309
abs() 1713113360 14.400
simple_abs() 1713113360 17.936
omid_abs() 1713113360 7.932 !!! Yes, reprocucable
sbb_abs() 1713113360 17.144
cdq_abs() 1713113360 17.555
fish_abs() 1713113360 20.900
sar_abs() 1713113360 16.464
cmovl_abs() 1713113360 17.365
cmovs_abs() 1713113360 17.345
http://www.talkchess.com/forums/1/message.html?304949
and following. Do you have an explanation for adding code and doubling the
speed?
Regards,
Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.