Author: Gerd Isenberg
Date: 06:49:06 10/16/03
Go up one level in this thread
<snip> >>P4 (and AMD64) hah 8 128-bit SSE2 registers that can be treated (among other >>things) as 4 32-bit floats or 2 64-bit floats. You can do some operations on >>those registers in parallel, for example you can add two float vectors of length >>2 using one instruction. I am not sure if current P4 implementation performs >>that addition in one cycle (that is definitely no so for Opteron/AMD64), but >>nothing in theory prevents this. >> > >IIRC, two cycles latency for most common logical, arithmetical and shift >mmxReg[,mmxReg]-instructions on P4 (movdqa reg,reg takes 6!), SIMD float as well >as double and integer (plus 1 cycle throughput). oups, sorry, float and double arithmetic instructions have higher latency, on P4 as well on AMD64. Two cycles is only true for SSE2 integer instructions (i used so far), such as pand,por,pxor,padd... Intel ® Pentium ® 4 and Intel ® Xeon™ Processor Optimization Reference Manual (latency,throughput): ADDPS xmm, xmm 4,2 ADDPD xmm, xmm 4,2 MULPD xmm, xmm 6,2 > I think same for AMD64, so >called double direct path instructions, decoded as two 64-bit macro ops. Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ Processors Latency: ADDPS xmm, xmm 5 ADDPD xmm, xmm 5 MULPD xmm, xmm 5 Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.