Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: P4?

Author: Gerd Isenberg

Date: 06:49:06 10/16/03

Go up one level in this thread


<snip>
>>P4 (and AMD64) hah 8 128-bit SSE2 registers that can be treated (among other
>>things) as 4 32-bit floats or 2 64-bit floats. You can do some operations on
>>those registers in parallel, for example you can add two float vectors of length
>>2 using one instruction. I am not sure if current P4 implementation performs
>>that addition in one cycle (that is definitely no so for Opteron/AMD64), but
>>nothing in theory prevents this.
>>
>
>IIRC, two cycles latency for most common logical, arithmetical and shift
>mmxReg[,mmxReg]-instructions on P4 (movdqa reg,reg takes 6!), SIMD float as well
>as double and integer (plus 1 cycle throughput).


oups, sorry, float and double arithmetic instructions have higher latency, on P4
as well on AMD64. Two cycles is only true for SSE2 integer instructions (i used
so far), such as pand,por,pxor,padd...

Intel ® Pentium ® 4
and Intel ® Xeon™
Processor Optimization
Reference Manual

(latency,throughput):

ADDPS xmm, xmm 4,2
ADDPD xmm, xmm 4,2
MULPD xmm, xmm 6,2


> I think same for AMD64, so
>called double direct path instructions, decoded as two 64-bit macro ops.

Software Optimization
Guide for AMD Athlon™ 64
and
AMD Opteron™ Processors

Latency:

ADDPS xmm, xmm 5
ADDPD xmm, xmm 5
MULPD xmm, xmm 5

Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.