Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: How to set up a 64 bit console project on VS2005?

Author: Gerd Isenberg

Date: 11:07:04 09/28/05

Go up one level in this thread


>>>You said it! They keep this bad tradition. The problem is shift operations are
>>>slow on P4 class processors. Shift is still slow on x86-64, right?
>>
>>No, best case on x86-64, direct path, 1 cycle latency with 8/16/32/64-bit
>>registers, regardless of the number of immediate or variable (cl) shifts.
>>And of course a huge win for 64-bit shifts!
>
>1 cycle for qword shift on AMD CPU?! This is a really good surprise to me!
>How about the shift operations on Intel x64? I don't think a P4 with x64
>extension can do one cycle shift on an integer of any size.


P4 has four cycles latency for shifts (1 cycle Throughput), no idea about the
x86-64 clone, i guess Centrino (based on PIII) is better.

some notes from

Intel Pentium 4 and Intel Xeon Processor Optimization Reference Manual:

--------------------------------------------------------------------------
C. IA-32 Instruction Latency and Throughput

* Minimize the latency of dependence chains that are on the critical path. For
example, an operation to shift left by two bits executes faster when encoded as
two adds than when it is encoded as a shift. If latency is not an issue, the
shift results in a denser byte encoding.

Definitions
The IA-32 instruction performance data are listed in several tables. The tables
contain the following information: Instruction Name:The assembly mnemonic of
each instruction.

Latency:
The number of clock cycles that are required for the execution core
to complete the execution of all of the ?ops that form a IA-32
instruction.

Throughput:
The number of clock cycles required to wait before the issue ports
are free to accept the same instruction again. For many IA-32
instructions, the throughput of an instruction can be significantly less
than its latency.

...

Comparisons of latency and throughput data between the Pentium 4 processor and
the Pentium III processor can be misleading, because one cycle in the Pentium 4
processor is NOT equal to one cycle in the Pentium III processor. The Pentium 4
processor is designed to operate at higher clock frequencies than the Pentium
III processor. Many IA-32 instructions can operate with either registers as
their operands or with a combination of register/memory address as their
operands. The performance of a given instruction between these two types is
different.

C-13

SAL/SAR/SHL/SHR 4 1

Table Footnotes

1. Latency information for many of instructions that are complex (> 4 ?ops) are
estimates based on conservative and worst-case estimates. Actual performance of
these instructions by the out-of-order core execution unit can range from
somewhat faster to significantly faster than the nominal latency data shown in
these tables.



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.