Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: pipelining example

Author: Anthony Cozzie
Date: 12:53:35 12/08/03
On December 08, 2003 at 14:48:17, Gerd Isenberg wrote:

>On December 08, 2003 at 12:23:59, Anthony Cozzie wrote:
>
>>>Thanks Anthony,
>>>
>>>i found even more information on Floating-Point Pipeline Stages at Page 220ff.
>>>I missed those information in Athlon-64/Opteron guide :-)
>>>
>>>I guess with "parts" you already mean these pipelined execution states 7..15, or
>>>only 12..15?
>>
>>There is some hunk of digital logic that performs the operation: a chain of
>>gates.  AMD figured out the critical path, and broke it in two, and each section
>>is a pipeline stage.  Back in the ancient 386 days, there was no pipeline: fetch
>>+decode+alu+cache+store all happened in 1 clock.
>
>
>Yes, i studied a kind of hardware oriented computer science in the 70s and early
>80s. I had to debug an self constructed and build
>8086/8089(I/O-processor)-system with ICE and Logic-Analyzer ;-)
>
>
>>
>>>Does the 2-cycle MMX latency include the sequence of all states from 7..15 (some
>>>skipped), or only 12..15?
>>
>>ALU only.
>>
>>>What is the maximum MMX throughput of two cycle mmx- direct path (FADD/FMUL)
>>>instructions, with others independently sheduled?
>>>I had the impression to get a max. throughput of < 0.5 but not one.
>>>
>>>Gerd
>>
>>1/cycle.
>
>Hmmm... that contradicts my empirical discovery of 2/cycle ;-)

Sorry, I meant _each ALU_ can do 1/cycle.

>>
>>I had to do this a lot in 18-347 :)
>
>Aha, not familar with. Google says something about
>"Introduction to Computer Architecture".

yep.  Also known as "write a 5 stage MIPS processor in verilog".

>>Lets suppose we have the mythical "single-pipelined athlon" which has only one
>>floating point pipe, and executes instructions in-order, and our instruction
>>list is:
>>
>>pand   (pa)
>>pxor   (px)
>>pandn  (pn)
>>por    (po)
>>
>>and the operands are such that we have no data dependencies, and no cache
>>misses, etc.  "The good case" if you will.  Of course, most of the pipeline
>>stages make no sense (we don't need a rename file for a non-superscalar
>>processor) but this makes things easier :)
>>
>>So, if we start the count from when the instructions have already been decoded
>>and sent to the floating point section, the processor looks like this:
>>
>>cycle  SM   RR   SS   FPU_RF   ALU1   ALU2
>>   1   pa
>>   2   px   pa
>>   3   pn   px   pa
>>   4   po   pn   px    pa
>>   5   T1   po   pn    px       pa
>>   6             po    pn       px    pa
>>   7                   po       pn    px
>>   8                            po    pn
>>   9                                  po
>>
>
>What does those states mean?
>
>SM ?
>RR = register renaming ?
>SS ?
>FPU_RF ?   register fetch?

same names as on the page in the manual.

>>so you can see that the processor can finish 1 instruction/cycle.
>
>1 instruction/cycle/state?
>
>There may be independent leading and trailing instructions without stalls,
>filling the gaps, so that each cycle is busy with four or more pipe states?
>I thought ALU1 and ALU2 are disjount for MMX and not sequential?
>Or do i confuse ALU1/ALU2 states with FADD/FMUL?

ALU1/ALU2 is FADD/FMUL.  There are only 2 (not 4) because the latency is only 2.

>>
>>So this is the good case for pipelining, and its the reason the P4 owns on
>>signals stuff.  Now for a case thats a little more annoying: some real data
>>dependencies
>>
>>pand    mm0, mm1
>>pxor    mm2, mm3
>>pandn   mm2, mm4
>>por     mm2, mm5
>>
>>cycle  SM   RR   SS   FPU_RF   ALU1   ALU2
>>   1   pa
>>   2   px   pa
>>   3   pn   px   pa
>>   4   po   pn   px    pa
>>   5        po   pn    px       pa
>>   6             po    pn       px    pa
>>   7             po    pn             px    <-- data stall
>>   8                   po       pn
>>   9                   po             pn    <-- data stall
>>  10                            po
>>  11                                  po
>>
>>And this is why the latency is 2: if you use the result immediately, there is a
>>stall, because the processor has to wait for the previous computation to finish.
>>
>>I hope I'm not explaining stuff you already knew.
>
>
>Absolutely not. I have some vague knowledge...
>
>My naive imagination of two cycles latency of a typical MMX-instruction is/was
>the following (due to my empirical experience):
>
>I imagine some sub-cycle states, triggered on rising or trailing edges.
>About one cycle for register(s)-rename/fetch/store/latching or whatever and the
>rest pure ALU-latency e.g. of the combinational adder. Both hardware resources
>are at least existing twice. Therefore up to four or even more instructions may
>be executed simultaniously ;-)

You are thinking of this in the wrong way.  Don't think of it as states, think
of it as stages.  It's like subway.  One person puts the meat on the sandwich,
one person puts the vegetables on, and one person rings you up.

But you are right about the instructions: in the best case there are 4
instructions in the ALUs every cycle.  Now imagine the entire processor: there
will be 50 or 100 instructions in flight at any given time :)

>Cheers,
>Gerd
>
>
>>
>>anthony
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.