Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: more resullts

Author: Bo Persson

Date: 06:34:58 07/07/03

On July 07, 2003 at 08:48:39, Gerd Isenberg wrote:

>at least one colleague has the same strange effect than Dieter:
>
>Gerd
>
>
>Gerd P4 2.4GHz:
>       nothing 3951541892 13.390
>         abs() 1713113360 13.141
>  simple_abs() 1713113360 19.562
>    omid_abs() 1713113360 13.672
>     sbb_abs() 1713113360 17.969
>     cdq_abs() 1713113360 17.625
>    fish_abs() 1713113360 21.750
>     sar_abs() 1713113360 16.984
>   cmovl_abs() 1713113360 16.782
>   cmovs_abs() 1713113360 16.781
>

This isn't as strange as it might seem. We are trying to time an *extremely*
small piece of code. The instructions selected by the compiler actually executes
att a different speed on different processors.

I have MSVC 7.1 where do_nothing results in:

; 304  :     for (i = 0; i < MAX_ITERATIONS; ++i) {
; 305  :
; 306  :         // subtract so we get both positive and negative numbers
; 307  :         int a = rand() - 16384;

  00020	e8 00 00 00 00	 call	 _rand
  00025	4f		 dec	 edi

; 308  :
; 309  :         sum += a;

  00026	8d b4 06 00 c0
	ff ff		 lea	 esi, DWORD PTR [esi+eax-16384]
  0002d	75 f1		 jne	 SHORT $L10491

; 310  :     }

Here an LEA is used to compute sum + a - 16384 in a single instruction!

while test_abs is just slightly different:

; 25   :     for (i = 0; i < MAX_ITERATIONS; ++i) {
; 26   :
; 27   :         // subtract so we get both positive and negative numbers
; 28   :         int a = rand() - 16384;

  00020	e8 00 00 00 00	 call	 _rand
  00025	2d 00 40 00 00	 sub	 eax, 16384		; 00004000H

; 29   :
; 30   :         sum += abs(a);

  0002a	99		 cdq
  0002b	33 c2		 xor	 eax, edx
  0002d	2b c2		 sub	 eax, edx
  0002f	03 f0		 add	 esi, eax
  00031	4f		 dec	 edi
  00032	75 ec		 jne	 SHORT $L10356

On a P4 the LEA instruction is broken up into several (but unspecified)
micro-ops. It is not fast - in fact Intel says that it is no longer an
optimization to use it! On the PIII, of course, it has dedicated hardware...

Except for the CDQ, all the other instructions are in the core RISC set, that
executes at up to 3 instructions per clock on a P4.

So doing something fast *can* be quicker than doing nothing slowly. :-)

Bo Persson
bop2@telia.com

Re: more resullts Gerd Isenberg 07:05:22 07/07/03

This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.