Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Architecture question (Athlon - MMX)

Author: Gerd Isenberg

Date: 07:16:23 12/08/03

Go up one level in this thread


On December 08, 2003 at 08:28:48, Sven Reichard wrote:

>On December 07, 2003 at 09:24:35, Gerd Isenberg wrote:
>
>>On December 06, 2003 at 18:08:01, Sven Reichard wrote:
>>>
>>Hi Sven,
>>
>>i made similar experience with Athlon XP and MMX fill algorithms, with four
>>independent instruction chains, with up to two MMX-instructions per cycle.
>>Thus a factor of four speedup for pipelined parallel over sequential.
>>
>>I'm not quite sure what instruction latency exactly is. I guess it is time to
>>decode plus time to execute the instruction. I further guess, that the pure
>>MMX-ALU execute latency is only one cycle or less (there is also a third
>>store/load unit) and that decode and execute of different instructions is done
>>simultaneously with direct path instructions, in opposite to vector path
>>instructions, which exclusively blocks all decode and execution-units.
>>
>>Gerd
>
>Gerd,
>
>there seems to be no clear definition of latency in the documents, although I
>deduced from context that it is more or less the time an instruction blocks an
>ALU. Also, the description of the FPU/MMX pipeline still isn't clear to me.

Hi Sven,

hmm, it seems at least from AMD's 64-bit guide, that execution latency does
exclude fetch and decode time. But i guess that there is "more" than the pure
ALU-Latency (one micro-op?). There is still the Instruction Control Unit (ICU),
that controls the centralized in-flight reorder buffer, the integer scheduler,
and the floating-point scheduler (x87 MMX 3DNow SSE SSE2).

from:

Software Optimization
Guide for AMD Athlon™ 64
and AMD Opteron™ Processors

Interpreting Latencies

The Latency column for an instruction entry shows the static execution latency
for the instruction. The static execution latency is the number of clock cycles
it takes to execute the serially dependent sequence of micro-ops that comprise
the instruction. The latencies in this appendix are estimates and are subject to
change. They assume that:
• The instruction is an L1-cache hit that has already been fetched and decoded,
with the operations loaded into the scheduler.
• Memory operands are assumed to be in the L1 data cache.
• There is no contention for execution resources or load-store unit resources.

The instruction control unit can simultaneously dispatch multiple macro-ops from
the reorder buffer to both the integer and floating-point schedulers for final
decode, issue, and execution as micro-ops. In addition, the instruction control
unit handles exceptions and manages the retirement of macro-ops.

>
>In my code I only use Direct Path instructions which can execute in either of
>the two pipelines (no MUL's or such). Do you have an idea how far instructions
>should be spread to avoid store/load hazards?
>
>Thanks for your comments,
>Sven.

I found this about the store/load issue in Atlon-64 optimization manual:
(i guess Athlon-32 it similar).

Cheers,
Gerd


-----------------------------------------------------------------------------
2.8 Unnecessary Store-to-Load Dependencies

A store-to-load dependency exists when data is stored to memory, only to be read
back shortly thereafter. For details, see “Store-to-Load Forwarding
Restrictions” on page 123. The AMD Athlon™ 64 and AMD Opteron™ processors
contain hardware to accelerate such store-to-load dependencies, allowing the
load to obtain the store data before it has been written to memory.

However, it is still faster to avoid such dependencies altogether and keep the
data in an internal register. Avoiding store-to-load dependencies is especially
important if they are part of a long dependency chain, as may occur in a
recurrence computation. If the dependency occurs while operating on arrays,
many compilers are unable to optimize the code in a way that avoids the
store-to-load dependency. In some instances the language definition may prohibit
the compiler from using code transformations that would remove the store-to-load
dependency. Therefore, it is recommended that the programmer remove the
dependency manually, for example, by introducing a temporary variable that can
be kept in a register, as in the following example. This can result in a
significant performance increase.
....

5.4 Store-to-Load Forwarding Restrictions

Store-to-load forwarding refers to the process of a load reading (forwarding)
data from the store buffer. When this can occur, it improves performance because
the load does not have to wait for the recently written (stored) data to be
written to cache and then read back out again. There are instances
in the load-store architecture of the AMD Athlon 64 and AMD Opteron processors
when a load operation is not allowed to read needed data from a store in the
store buffer.

In these cases, the load cannot complete (load the needed data into a register)
until the store has retired out of the store buffer and written to the data
cache. A store-buffer entry cannot retire and write to the data cache until
every instruction before the store has completed and retired from the reorder
buffer. The implication of this restriction is that all instructions in the
reorder buffer, up to and including the store, must complete and retire out of
the reorder buffer before the load can complete. Effectively, the load has a
false dependency on every instruction up to the store. Due to the significant
depth of the LS buffer of the AMD Athlon 64 and AMD Opteron processors, any load
that is dependent on a store that cannot bypass data through the LS buffer may
experience significant delays of up to tens of clock cycles, where the exact
delay is a function of pipeline conditions.

The following sections describe store-to-load forwarding examples.

Store-to-Load Forwarding Pitfalls—True Dependencies

A load is allowed to read data from the store-buffer entry only if all of the
following conditions are satisfied:
• The start address of the load matches the start address of the store.
• The load operand size is equal to or smaller than the store operand size.
• Neither the load nor the store is misaligned.
• The store data is not from a high-byte register (AH, BH, CH, or DH).


The following sections describe common-case scenarios to avoid. In these
scenarios, a load has a true dependency on an LS2-buffered store, but cannot
read (forward) data from a store-buffer entry.

....

One Supported Store-to-Load Forwarding Case

There is one case of a mismatched store-to-load forwarding that is supported by
AMD Athlon 64 and AMD Opteron processors. The lower 32 bits from an aligned
quadword write feeding into a doubleword read is allowed, as illustrated in the
following example:

movq [alignedQword], mm0
...
mov eax, [alignedQword]


9.4 Avoid Moving Data Directly Between General-Purpose and MMX™ Registers

Optimization

Avoid moving data directly between general-purpose registers and MMX™ registers;
this operation requires the use of the MOVD instruction. If it’s absolutely
necessary to move data between these two types of registers, use separate store
and load instructions to move the data from the source register to a temporary
location in memory and then from memory into the destination register,
separating the store and the load by at least 10 instructions.

...

Rationale

The register-to-register forms of the MOVD instruction are either VectorPath or
DirectPath Double instructions. When compared with DirectPath Single
instructions, VectorPath and DirectPath Double instructions have comparatively
longer execution latencies. In addition, VectorPath instructions prevent the
processor from simultaneously decoding other insructions.

Example
Avoid code like this, which copies a value directly from an MMX register to a
general-purpose register:

movd eax, mm2

If it’s absolutely necessary to copy a value from an MMX register to a
general-purpose register (or vice versa), use separate store and load
instructions, separating them by at least 10 instructions:

movd DWORD PTR temp, mm2 ; Store the value in memory.
...
; At least 10 other instructions appear here.
...
mov eax, DWORD PTR temp ; Load the value from memory.






This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.