Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: pipelining example

Author: Gerd Isenberg

Date: 11:48:17 12/08/03

Go up one level in this thread


On December 08, 2003 at 12:23:59, Anthony Cozzie wrote:

>>Thanks Anthony,
>>
>>i found even more information on Floating-Point Pipeline Stages at Page 220ff.
>>I missed those information in Athlon-64/Opteron guide :-)
>>
>>I guess with "parts" you already mean these pipelined execution states 7..15, or
>>only 12..15?
>
>There is some hunk of digital logic that performs the operation: a chain of
>gates.  AMD figured out the critical path, and broke it in two, and each section
>is a pipeline stage.  Back in the ancient 386 days, there was no pipeline: fetch
>+decode+alu+cache+store all happened in 1 clock.


Yes, i studied a kind of hardware oriented computer science in the 70s and early
80s. I had to debug an self constructed and build
8086/8089(I/O-processor)-system with ICE and Logic-Analyzer ;-)


>
>>Does the 2-cycle MMX latency include the sequence of all states from 7..15 (some
>>skipped), or only 12..15?
>
>ALU only.
>
>>What is the maximum MMX throughput of two cycle mmx- direct path (FADD/FMUL)
>>instructions, with others independently sheduled?
>>I had the impression to get a max. throughput of < 0.5 but not one.
>>
>>Gerd
>
>1/cycle.

Hmmm... that contradicts my empirical discovery of 2/cycle ;-)

>
>I had to do this a lot in 18-347 :)

Aha, not familar with. Google says something about
"Introduction to Computer Architecture".

>
>Lets suppose we have the mythical "single-pipelined athlon" which has only one
>floating point pipe, and executes instructions in-order, and our instruction
>list is:
>
>pand   (pa)
>pxor   (px)
>pandn  (pn)
>por    (po)
>
>and the operands are such that we have no data dependencies, and no cache
>misses, etc.  "The good case" if you will.  Of course, most of the pipeline
>stages make no sense (we don't need a rename file for a non-superscalar
>processor) but this makes things easier :)
>
>So, if we start the count from when the instructions have already been decoded
>and sent to the floating point section, the processor looks like this:
>
>cycle  SM   RR   SS   FPU_RF   ALU1   ALU2
>   1   pa
>   2   px   pa
>   3   pn   px   pa
>   4   po   pn   px    pa
>   5   T1   po   pn    px       pa
>   6             po    pn       px    pa
>   7                   po       pn    px
>   8                            po    pn
>   9                                  po
>

What does those states mean?

SM ?
RR = register renaming ?
SS ?
FPU_RF ?   register fetch?


>so you can see that the processor can finish 1 instruction/cycle.

1 instruction/cycle/state?

There may be independent leading and trailing instructions without stalls,
filling the gaps, so that each cycle is busy with four or more pipe states?
I thought ALU1 and ALU2 are disjount for MMX and not sequential?
Or do i confuse ALU1/ALU2 states with FADD/FMUL?


>
>So this is the good case for pipelining, and its the reason the P4 owns on
>signals stuff.  Now for a case thats a little more annoying: some real data
>dependencies
>
>pand    mm0, mm1
>pxor    mm2, mm3
>pandn   mm2, mm4
>por     mm2, mm5
>
>cycle  SM   RR   SS   FPU_RF   ALU1   ALU2
>   1   pa
>   2   px   pa
>   3   pn   px   pa
>   4   po   pn   px    pa
>   5        po   pn    px       pa
>   6             po    pn       px    pa
>   7             po    pn             px    <-- data stall
>   8                   po       pn
>   9                   po             pn    <-- data stall
>  10                            po
>  11                                  po
>
>And this is why the latency is 2: if you use the result immediately, there is a
>stall, because the processor has to wait for the previous computation to finish.
>
>I hope I'm not explaining stuff you already knew.


Absolutely not. I have some vague knowledge...

My naive imagination of two cycles latency of a typical MMX-instruction is/was
the following (due to my empirical experience):

I imagine some sub-cycle states, triggered on rising or trailing edges.
About one cycle for register(s)-rename/fetch/store/latching or whatever and the
rest pure ALU-latency e.g. of the combinational adder. Both hardware resources
are at least existing twice. Therefore up to four or even more instructions may
be executed simultaniously ;-)

Cheers,
Gerd


>
>anthony



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.