Author: Gerd Isenberg
Date: 11:07:04 09/28/05
Go up one level in this thread
>>>You said it! They keep this bad tradition. The problem is shift operations are >>>slow on P4 class processors. Shift is still slow on x86-64, right? >> >>No, best case on x86-64, direct path, 1 cycle latency with 8/16/32/64-bit >>registers, regardless of the number of immediate or variable (cl) shifts. >>And of course a huge win for 64-bit shifts! > >1 cycle for qword shift on AMD CPU?! This is a really good surprise to me! >How about the shift operations on Intel x64? I don't think a P4 with x64 >extension can do one cycle shift on an integer of any size. P4 has four cycles latency for shifts (1 cycle Throughput), no idea about the x86-64 clone, i guess Centrino (based on PIII) is better. some notes from Intel Pentium 4 and Intel Xeon Processor Optimization Reference Manual: -------------------------------------------------------------------------- C. IA-32 Instruction Latency and Throughput * Minimize the latency of dependence chains that are on the critical path. For example, an operation to shift left by two bits executes faster when encoded as two adds than when it is encoded as a shift. If latency is not an issue, the shift results in a denser byte encoding. Definitions The IA-32 instruction performance data are listed in several tables. The tables contain the following information: Instruction Name:The assembly mnemonic of each instruction. Latency: The number of clock cycles that are required for the execution core to complete the execution of all of the ?ops that form a IA-32 instruction. Throughput: The number of clock cycles required to wait before the issue ports are free to accept the same instruction again. For many IA-32 instructions, the throughput of an instruction can be significantly less than its latency. ... Comparisons of latency and throughput data between the Pentium 4 processor and the Pentium III processor can be misleading, because one cycle in the Pentium 4 processor is NOT equal to one cycle in the Pentium III processor. The Pentium 4 processor is designed to operate at higher clock frequencies than the Pentium III processor. Many IA-32 instructions can operate with either registers as their operands or with a combination of register/memory address as their operands. The performance of a given instruction between these two types is different. C-13 SAL/SAR/SHL/SHR 4 1 Table Footnotes 1. Latency information for many of instructions that are complex (> 4 ?ops) are estimates based on conservative and worst-case estimates. Actual performance of these instructions by the out-of-order core execution unit can range from somewhat faster to significantly faster than the nominal latency data shown in these tables.
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.