Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Programmer challenge

Author: Dann Corbit
Date: 13:54:05 02/21/03
On February 21, 2003 at 10:56:50, Dezhi Zhao wrote:

>On February 21, 2003 at 00:33:33, Eugene Nalimov wrote:
>
>>I believe you can slightly speedup the loop without unrolling the loop or
>>changing the algorithm -- actually code size would be exactly the same:
>>
>>        mov     BL,1		   // work variable
>>	mov	EDX,...            // pointer to move list table
>>	mov	ESI,-1             // pointer to highest value
>>        jmp     loop
>>
>>better:
>>        mov     BL,CL              // value=move_value[x]
>>        mov     ESI,EDX            // y=x
>>
>>loop:   mov     CL,move_value[EDX] // CL = move_value[x]
>>        inc     EDX                // x++
>>
>>        cmp     CL,BL              // if (move_value[x] <= value)
>>        jbe     loop
>>
>>        cmp     CL,0FFh            // if (move_value[x]==255)
>>        jne     better
>>
>>Thanks,
>>Eugene
>>
>
>inc edx should be replaced by add edx, 1
>inc is slow on P4.


From this document:
Intel® Pentium® 4
and Intel® Xeon™
Processor Optimization
Reference Manual
Issued in U.S.A.
Order Number: 248966-007

which provides these definitions:
Definitions
The IA-32 instruction performance data are listed in several tables. The tables
contain
the following information:
Instruction Name:The assembly mnemonic of each instruction.
Latency: The number of clock cycles that are required for the execution core
to complete the execution of all of the µops that form a IA-32
instruction.
Throughput: The number of clock cycles required to wait before the issue ports
are free to accept the same instruction again. For many IA-32
instructions, the throughput of an instruction can be significantly less
than its latency.
Execution units: The names of the execution units in the execution core that are
utilized to execute the µops for each instruction. This information is
provided only for IA-32 instructions that are decoded into no more
than 4 µops. µops for instructions that decode into more than 4 µops
are supplied by microcode ROM. Note that several execution units
may share the same port, such as FP_ADD, FP_MUL, or MMX_SHFT in
the FP_EXECUTE cluster (see Figure 1-4).
Latency and Throughput
This section presents the latency and throughput information for the IA-32
instruction
set including the Streaming SIMD Extensions 2, Streaming SIMD Extensions, MMX
technology, and most of the frequently used general-purpose integer and x87
floating-point instructions.
Due to the complexity of dynamic execution and out-of-order nature of the
execution
core, the instruction latency data may not be sufficient to accurately predict
realistic
performance of actual code sequences based on adding instruction latency data.
• The instruction latency data are only meant to provide a relative comparison
of
instruction-level performance of IA-32 instructions based on the Intel NetBurst
micro-architecture.
• All numeric data in the tables are:
— approximate and are subject to change in future implementations of the Intel
NetBurst micro-architecture.
— not meant to be used as reference numbers for comparisons of instruction-level
performance benchmarks. Comparison of instruction-level performance of
microprocessors that are based on different micro-architecture is a complex
subject that requires additional information that is beyond the scope of this
manual.
Comparisons of latency and throughput data between the Pentium 4 processor and
the
Pentium III processor can be misleading, because one cycle in the Pentium 4
processor
is NOT equal to one cycle in the Pentium III processor. The Pentium 4 processor
is
designed to operate at higher clock frequencies than the Pentium III processor.
Many
IA-32 instructions can operate with either registers as their operands or with a
combination of register/memory address as their operands. The performance of a
given
instruction between these two types is different.
The section that follows, “Latency and Throughput with Register Operands”, gives
the
latency and throughput data for the register-to-register instruction type.
Section
“Latency and Throughput with Memory Operands” discusses how to adjust latency
and throughput specifications for the register-to-memory and memory-to-register
instructions.
In some cases, the latency or throughput figures given are just one half of a
clock. This
occurs only for the double-speed ALUs.

We find the Table C-7 IA-32 General Purpose Instructions

which contains the following timings:

Instruction        Latency1 Throughput Execution Unit2
ADC/SBB reg, imm     6        2          ALU
DEC/INC              1        0.5        ALU
Re: Programmer challenge Dezhi Zhao 14:24:00 02/21/03
- Re: Programmer challenge Dann Corbit 15:55:12 02/21/03
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.