Author: Gerd Isenberg
Date: 11:48:17 12/08/03
Go up one level in this thread
On December 08, 2003 at 12:23:59, Anthony Cozzie wrote: >>Thanks Anthony, >> >>i found even more information on Floating-Point Pipeline Stages at Page 220ff. >>I missed those information in Athlon-64/Opteron guide :-) >> >>I guess with "parts" you already mean these pipelined execution states 7..15, or >>only 12..15? > >There is some hunk of digital logic that performs the operation: a chain of >gates. AMD figured out the critical path, and broke it in two, and each section >is a pipeline stage. Back in the ancient 386 days, there was no pipeline: fetch >+decode+alu+cache+store all happened in 1 clock. Yes, i studied a kind of hardware oriented computer science in the 70s and early 80s. I had to debug an self constructed and build 8086/8089(I/O-processor)-system with ICE and Logic-Analyzer ;-) > >>Does the 2-cycle MMX latency include the sequence of all states from 7..15 (some >>skipped), or only 12..15? > >ALU only. > >>What is the maximum MMX throughput of two cycle mmx- direct path (FADD/FMUL) >>instructions, with others independently sheduled? >>I had the impression to get a max. throughput of < 0.5 but not one. >> >>Gerd > >1/cycle. Hmmm... that contradicts my empirical discovery of 2/cycle ;-) > >I had to do this a lot in 18-347 :) Aha, not familar with. Google says something about "Introduction to Computer Architecture". > >Lets suppose we have the mythical "single-pipelined athlon" which has only one >floating point pipe, and executes instructions in-order, and our instruction >list is: > >pand (pa) >pxor (px) >pandn (pn) >por (po) > >and the operands are such that we have no data dependencies, and no cache >misses, etc. "The good case" if you will. Of course, most of the pipeline >stages make no sense (we don't need a rename file for a non-superscalar >processor) but this makes things easier :) > >So, if we start the count from when the instructions have already been decoded >and sent to the floating point section, the processor looks like this: > >cycle SM RR SS FPU_RF ALU1 ALU2 > 1 pa > 2 px pa > 3 pn px pa > 4 po pn px pa > 5 T1 po pn px pa > 6 po pn px pa > 7 po pn px > 8 po pn > 9 po > What does those states mean? SM ? RR = register renaming ? SS ? FPU_RF ? register fetch? >so you can see that the processor can finish 1 instruction/cycle. 1 instruction/cycle/state? There may be independent leading and trailing instructions without stalls, filling the gaps, so that each cycle is busy with four or more pipe states? I thought ALU1 and ALU2 are disjount for MMX and not sequential? Or do i confuse ALU1/ALU2 states with FADD/FMUL? > >So this is the good case for pipelining, and its the reason the P4 owns on >signals stuff. Now for a case thats a little more annoying: some real data >dependencies > >pand mm0, mm1 >pxor mm2, mm3 >pandn mm2, mm4 >por mm2, mm5 > >cycle SM RR SS FPU_RF ALU1 ALU2 > 1 pa > 2 px pa > 3 pn px pa > 4 po pn px pa > 5 po pn px pa > 6 po pn px pa > 7 po pn px <-- data stall > 8 po pn > 9 po pn <-- data stall > 10 po > 11 po > >And this is why the latency is 2: if you use the result immediately, there is a >stall, because the processor has to wait for the previous computation to finish. > >I hope I'm not explaining stuff you already knew. Absolutely not. I have some vague knowledge... My naive imagination of two cycles latency of a typical MMX-instruction is/was the following (due to my empirical experience): I imagine some sub-cycle states, triggered on rising or trailing edges. About one cycle for register(s)-rename/fetch/store/latching or whatever and the rest pure ALU-latency e.g. of the combinational adder. Both hardware resources are at least existing twice. Therefore up to four or even more instructions may be executed simultaniously ;-) Cheers, Gerd > >anthony
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.