Author: Anthony Cozzie
Date: 09:23:59 12/08/03
Go up one level in this thread
>Thanks Anthony, > >i found even more information on Floating-Point Pipeline Stages at Page 220ff. >I missed those information in Athlon-64/Opteron guide :-) > >I guess with "parts" you already mean these pipelined execution states 7..15, or >only 12..15? There is some hunk of digital logic that performs the operation: a chain of gates. AMD figured out the critical path, and broke it in two, and each section is a pipeline stage. Back in the ancient 386 days, there was no pipeline: fetch +decode+alu+cache+store all happened in 1 clock. >Does the 2-cycle MMX latency include the sequence of all states from 7..15 (some >skipped), or only 12..15? ALU only. >What is the maximum MMX throughput of two cycle mmx- direct path (FADD/FMUL) >instructions, with others independently sheduled? >I had the impression to get a max. throughput of < 0.5 but not one. > >Gerd 1/cycle. I had to do this a lot in 18-347 :) Lets suppose we have the mythical "single-pipelined athlon" which has only one floating point pipe, and executes instructions in-order, and our instruction list is: pand (pa) pxor (px) pandn (pn) por (po) and the operands are such that we have no data dependencies, and no cache misses, etc. "The good case" if you will. Of course, most of the pipeline stages make no sense (we don't need a rename file for a non-superscalar processor) but this makes things easier :) So, if we start the count from when the instructions have already been decoded and sent to the floating point section, the processor looks like this: cycle SM RR SS FPU_RF ALU1 ALU2 1 pa 2 px pa 3 pn px pa 4 po pn px pa 5 po pn px pa 6 po pn px pa 7 po pn px 8 po pn 9 po so you can see that the processor can finish 1 instruction/cycle. So this is the good case for pipelining, and its the reason the P4 owns on signals stuff. Now for a case thats a little more annoying: some real data dependencies pand mm0, mm1 pxor mm2, mm3 pandn mm2, mm4 por mm2, mm5 cycle SM RR SS FPU_RF ALU1 ALU2 1 pa 2 px pa 3 pn px pa 4 po pn px pa 5 po pn px pa 6 po pn px pa 7 po pn px <-- data stall 8 po pn 9 po pn <-- data stall 10 po 11 po And this is why the latency is 2: if you use the result immediately, there is a stall, because the processor has to wait for the previous computation to finish. I hope I'm not explaining stuff you already knew. anthony
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.