Computer Chess Club Archives


Search

Terms

Messages

Subject: pipelining example

Author: Anthony Cozzie

Date: 09:23:59 12/08/03

Go up one level in this thread


>Thanks Anthony,
>
>i found even more information on Floating-Point Pipeline Stages at Page 220ff.
>I missed those information in Athlon-64/Opteron guide :-)
>
>I guess with "parts" you already mean these pipelined execution states 7..15, or
>only 12..15?

There is some hunk of digital logic that performs the operation: a chain of
gates.  AMD figured out the critical path, and broke it in two, and each section
is a pipeline stage.  Back in the ancient 386 days, there was no pipeline: fetch
+decode+alu+cache+store all happened in 1 clock.

>Does the 2-cycle MMX latency include the sequence of all states from 7..15 (some
>skipped), or only 12..15?

ALU only.

>What is the maximum MMX throughput of two cycle mmx- direct path (FADD/FMUL)
>instructions, with others independently sheduled?
>I had the impression to get a max. throughput of < 0.5 but not one.
>
>Gerd

1/cycle.

I had to do this a lot in 18-347 :)

Lets suppose we have the mythical "single-pipelined athlon" which has only one
floating point pipe, and executes instructions in-order, and our instruction
list is:

pand   (pa)
pxor   (px)
pandn  (pn)
por    (po)

and the operands are such that we have no data dependencies, and no cache
misses, etc.  "The good case" if you will.  Of course, most of the pipeline
stages make no sense (we don't need a rename file for a non-superscalar
processor) but this makes things easier :)

So, if we start the count from when the instructions have already been decoded
and sent to the floating point section, the processor looks like this:

cycle  SM   RR   SS   FPU_RF   ALU1   ALU2
   1   pa
   2   px   pa
   3   pn   px   pa
   4   po   pn   px    pa
   5        po   pn    px       pa
   6             po    pn       px    pa
   7                   po       pn    px
   8                            po    pn
   9                                  po

so you can see that the processor can finish 1 instruction/cycle.

So this is the good case for pipelining, and its the reason the P4 owns on
signals stuff.  Now for a case thats a little more annoying: some real data
dependencies

pand    mm0, mm1
pxor    mm2, mm3
pandn   mm2, mm4
por     mm2, mm5

cycle  SM   RR   SS   FPU_RF   ALU1   ALU2
   1   pa
   2   px   pa
   3   pn   px   pa
   4   po   pn   px    pa
   5        po   pn    px       pa
   6             po    pn       px    pa
   7             po    pn             px    <-- data stall
   8                   po       pn
   9                   po             pn    <-- data stall
  10                            po
  11                                  po

And this is why the latency is 2: if you use the result immediately, there is a
stall, because the processor has to wait for the previous computation to finish.

I hope I'm not explaining stuff you already knew.

anthony



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.