Author: Dann Corbit
Date: 13:54:05 02/21/03
Go up one level in this thread
On February 21, 2003 at 10:56:50, Dezhi Zhao wrote: >On February 21, 2003 at 00:33:33, Eugene Nalimov wrote: > >>I believe you can slightly speedup the loop without unrolling the loop or >>changing the algorithm -- actually code size would be exactly the same: >> >> mov BL,1 // work variable >> mov EDX,... // pointer to move list table >> mov ESI,-1 // pointer to highest value >> jmp loop >> >>better: >> mov BL,CL // value=move_value[x] >> mov ESI,EDX // y=x >> >>loop: mov CL,move_value[EDX] // CL = move_value[x] >> inc EDX // x++ >> >> cmp CL,BL // if (move_value[x] <= value) >> jbe loop >> >> cmp CL,0FFh // if (move_value[x]==255) >> jne better >> >>Thanks, >>Eugene >> > >inc edx should be replaced by add edx, 1 >inc is slow on P4. From this document: Intel® Pentium® 4 and Intel® Xeon™ Processor Optimization Reference Manual Issued in U.S.A. Order Number: 248966-007 which provides these definitions: Definitions The IA-32 instruction performance data are listed in several tables. The tables contain the following information: Instruction Name:The assembly mnemonic of each instruction. Latency: The number of clock cycles that are required for the execution core to complete the execution of all of the µops that form a IA-32 instruction. Throughput: The number of clock cycles required to wait before the issue ports are free to accept the same instruction again. For many IA-32 instructions, the throughput of an instruction can be significantly less than its latency. Execution units: The names of the execution units in the execution core that are utilized to execute the µops for each instruction. This information is provided only for IA-32 instructions that are decoded into no more than 4 µops. µops for instructions that decode into more than 4 µops are supplied by microcode ROM. Note that several execution units may share the same port, such as FP_ADD, FP_MUL, or MMX_SHFT in the FP_EXECUTE cluster (see Figure 1-4). Latency and Throughput This section presents the latency and throughput information for the IA-32 instruction set including the Streaming SIMD Extensions 2, Streaming SIMD Extensions, MMX technology, and most of the frequently used general-purpose integer and x87 floating-point instructions. Due to the complexity of dynamic execution and out-of-order nature of the execution core, the instruction latency data may not be sufficient to accurately predict realistic performance of actual code sequences based on adding instruction latency data. • The instruction latency data are only meant to provide a relative comparison of instruction-level performance of IA-32 instructions based on the Intel NetBurst micro-architecture. • All numeric data in the tables are: — approximate and are subject to change in future implementations of the Intel NetBurst micro-architecture. — not meant to be used as reference numbers for comparisons of instruction-level performance benchmarks. Comparison of instruction-level performance of microprocessors that are based on different micro-architecture is a complex subject that requires additional information that is beyond the scope of this manual. Comparisons of latency and throughput data between the Pentium 4 processor and the Pentium III processor can be misleading, because one cycle in the Pentium 4 processor is NOT equal to one cycle in the Pentium III processor. The Pentium 4 processor is designed to operate at higher clock frequencies than the Pentium III processor. Many IA-32 instructions can operate with either registers as their operands or with a combination of register/memory address as their operands. The performance of a given instruction between these two types is different. The section that follows, “Latency and Throughput with Register Operands”, gives the latency and throughput data for the register-to-register instruction type. Section “Latency and Throughput with Memory Operands” discusses how to adjust latency and throughput specifications for the register-to-memory and memory-to-register instructions. In some cases, the latency or throughput figures given are just one half of a clock. This occurs only for the double-speed ALUs. We find the Table C-7 IA-32 General Purpose Instructions which contains the following timings: Instruction Latency1 Throughput Execution Unit2 ADC/SBB reg, imm 6 2 ALU DEC/INC 1 0.5 ALU
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.