Author: Gerd Isenberg
Date: 07:16:23 12/08/03
Go up one level in this thread
On December 08, 2003 at 08:28:48, Sven Reichard wrote: >On December 07, 2003 at 09:24:35, Gerd Isenberg wrote: > >>On December 06, 2003 at 18:08:01, Sven Reichard wrote: >>> >>Hi Sven, >> >>i made similar experience with Athlon XP and MMX fill algorithms, with four >>independent instruction chains, with up to two MMX-instructions per cycle. >>Thus a factor of four speedup for pipelined parallel over sequential. >> >>I'm not quite sure what instruction latency exactly is. I guess it is time to >>decode plus time to execute the instruction. I further guess, that the pure >>MMX-ALU execute latency is only one cycle or less (there is also a third >>store/load unit) and that decode and execute of different instructions is done >>simultaneously with direct path instructions, in opposite to vector path >>instructions, which exclusively blocks all decode and execution-units. >> >>Gerd > >Gerd, > >there seems to be no clear definition of latency in the documents, although I >deduced from context that it is more or less the time an instruction blocks an >ALU. Also, the description of the FPU/MMX pipeline still isn't clear to me. Hi Sven, hmm, it seems at least from AMD's 64-bit guide, that execution latency does exclude fetch and decode time. But i guess that there is "more" than the pure ALU-Latency (one micro-op?). There is still the Instruction Control Unit (ICU), that controls the centralized in-flight reorder buffer, the integer scheduler, and the floating-point scheduler (x87 MMX 3DNow SSE SSE2). from: Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ Processors Interpreting Latencies The Latency column for an instruction entry shows the static execution latency for the instruction. The static execution latency is the number of clock cycles it takes to execute the serially dependent sequence of micro-ops that comprise the instruction. The latencies in this appendix are estimates and are subject to change. They assume that: • The instruction is an L1-cache hit that has already been fetched and decoded, with the operations loaded into the scheduler. • Memory operands are assumed to be in the L1 data cache. • There is no contention for execution resources or load-store unit resources. The instruction control unit can simultaneously dispatch multiple macro-ops from the reorder buffer to both the integer and floating-point schedulers for final decode, issue, and execution as micro-ops. In addition, the instruction control unit handles exceptions and manages the retirement of macro-ops. > >In my code I only use Direct Path instructions which can execute in either of >the two pipelines (no MUL's or such). Do you have an idea how far instructions >should be spread to avoid store/load hazards? > >Thanks for your comments, >Sven. I found this about the store/load issue in Atlon-64 optimization manual: (i guess Athlon-32 it similar). Cheers, Gerd ----------------------------------------------------------------------------- 2.8 Unnecessary Store-to-Load Dependencies A store-to-load dependency exists when data is stored to memory, only to be read back shortly thereafter. For details, see “Store-to-Load Forwarding Restrictions” on page 123. The AMD Athlon™ 64 and AMD Opteron™ processors contain hardware to accelerate such store-to-load dependencies, allowing the load to obtain the store data before it has been written to memory. However, it is still faster to avoid such dependencies altogether and keep the data in an internal register. Avoiding store-to-load dependencies is especially important if they are part of a long dependency chain, as may occur in a recurrence computation. If the dependency occurs while operating on arrays, many compilers are unable to optimize the code in a way that avoids the store-to-load dependency. In some instances the language definition may prohibit the compiler from using code transformations that would remove the store-to-load dependency. Therefore, it is recommended that the programmer remove the dependency manually, for example, by introducing a temporary variable that can be kept in a register, as in the following example. This can result in a significant performance increase. .... 5.4 Store-to-Load Forwarding Restrictions Store-to-load forwarding refers to the process of a load reading (forwarding) data from the store buffer. When this can occur, it improves performance because the load does not have to wait for the recently written (stored) data to be written to cache and then read back out again. There are instances in the load-store architecture of the AMD Athlon 64 and AMD Opteron processors when a load operation is not allowed to read needed data from a store in the store buffer. In these cases, the load cannot complete (load the needed data into a register) until the store has retired out of the store buffer and written to the data cache. A store-buffer entry cannot retire and write to the data cache until every instruction before the store has completed and retired from the reorder buffer. The implication of this restriction is that all instructions in the reorder buffer, up to and including the store, must complete and retire out of the reorder buffer before the load can complete. Effectively, the load has a false dependency on every instruction up to the store. Due to the significant depth of the LS buffer of the AMD Athlon 64 and AMD Opteron processors, any load that is dependent on a store that cannot bypass data through the LS buffer may experience significant delays of up to tens of clock cycles, where the exact delay is a function of pipeline conditions. The following sections describe store-to-load forwarding examples. Store-to-Load Forwarding Pitfalls—True Dependencies A load is allowed to read data from the store-buffer entry only if all of the following conditions are satisfied: • The start address of the load matches the start address of the store. • The load operand size is equal to or smaller than the store operand size. • Neither the load nor the store is misaligned. • The store data is not from a high-byte register (AH, BH, CH, or DH). The following sections describe common-case scenarios to avoid. In these scenarios, a load has a true dependency on an LS2-buffered store, but cannot read (forward) data from a store-buffer entry. .... One Supported Store-to-Load Forwarding Case There is one case of a mismatched store-to-load forwarding that is supported by AMD Athlon 64 and AMD Opteron processors. The lower 32 bits from an aligned quadword write feeding into a doubleword read is allowed, as illustrated in the following example: movq [alignedQword], mm0 ... mov eax, [alignedQword] 9.4 Avoid Moving Data Directly Between General-Purpose and MMX™ Registers Optimization Avoid moving data directly between general-purpose registers and MMX™ registers; this operation requires the use of the MOVD instruction. If it’s absolutely necessary to move data between these two types of registers, use separate store and load instructions to move the data from the source register to a temporary location in memory and then from memory into the destination register, separating the store and the load by at least 10 instructions. ... Rationale The register-to-register forms of the MOVD instruction are either VectorPath or DirectPath Double instructions. When compared with DirectPath Single instructions, VectorPath and DirectPath Double instructions have comparatively longer execution latencies. In addition, VectorPath instructions prevent the processor from simultaneously decoding other insructions. Example Avoid code like this, which copies a value directly from an MMX register to a general-purpose register: movd eax, mm2 If it’s absolutely necessary to copy a value from an MMX register to a general-purpose register (or vice versa), use separate store and load instructions, separating them by at least 10 instructions: movd DWORD PTR temp, mm2 ; Store the value in memory. ... ; At least 10 other instructions appear here. ... mov eax, DWORD PTR temp ; Load the value from memory.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.