Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: P4?

Author: Robert Hyatt

Date: 13:16:06 10/15/03

Go up one level in this thread


On October 15, 2003 at 13:37:08, Eugene Nalimov wrote:

>On October 15, 2003 at 13:33:36, Eugene Nalimov wrote:
>
>>On October 15, 2003 at 13:26:11, Robert Hyatt wrote:
>>
>>>On October 15, 2003 at 11:50:10, Anthony Cozzie wrote:
>>>
>>>>On October 15, 2003 at 10:51:29, Robert Hyatt wrote:
>>>>
>>>>>
>>>>>I blew one bit of the previous calculation.  The C90 is a "super-scalar" sort
>>>>>of vector machine.  Where I said "one floating add per cycle" change that to
>>>>>two.  A single vector instruction does _two_ operations per cycle, not one, and
>>>>>I had simply failed to note that.  That was the main change from the older
>>>>>X-MP and Y-MP, that was introduced on the C90.  Obviously it makes vector
>>>>>performance 2x faster even without the clock speed improvement.  IE for my
>>>>>example:
>>>>>
>>>>>      v0    v1+v2
>>>>>      v3    v4+v5
>>>>>      v6    v0*v3
>>>>>
>>>>>that code will produce _six_ results per cycle, once the chained vector
>>>>>pipeline is filled.  Not the _three_ I had given.
>>>>>
>>>>>_that_ is why the Cray buries the PC in _any_ program that can use vectors.
>>>>>Even though the C90 only runs at 250 mhz.  The T90 runs that up to 500mhz,
>>>>>and the Cray-3 doubled it again to 1ghz.  But all mhz/ghz are _not_ created
>>>>>"equal" for those that understand vector operations.
>>>>>
>>>>>The C90 is a 250mhz machine, not the 100 Vincent pulls from you-know-where.
>>>>>But no 2500mhz 80x86 can produce 6 64-bit IEEE floating point operations
>>>>>every 4 nanoseconds.
>>>>>
>>>>>I don't know how to explain it better to someone that simply doesn't have a
>>>>>single scintilla of background on understanding the concept of "a vector
>>>>>machine."
>>>>
>>>>What about P4 with SSE2?
>>>>
>>>>According to my P4 optimization manual, P4 can do 2 DP adds in parallel, with
>>>>latency = 4. I *think* that the SSE2 ALUs in P4 are fully pipelined, so that
>>>>means 4 FP ops/clock.  Obviously it can only do this on vectorized data, but the
>>>>same constraint applies to the cray. Unfortunately, P4 was built do vectorized
>>>>multimedia-ish stuff, not computer chess :(
>>>>
>>>>anthony
>>>
>>>The P4 can't do 4 ops/clock.  Look at the FP processor architecture.  It is
>>>stack-based.  You execute an instruction that says (say) add the top of the
>>>stack and the contents of memory address X.  Or whatever.  And you are busy
>>>for a while.
>>>
>>>Just compare the peak FLOPS rating for a P4 vs a Cray C90, which is the machine
>>>we are talking about.  The T90 is 4x faster but that's not the machine of the
>>>early 90's we are talking about with respect to CB.
>>>
>>>The C90 easily sustains 8 floating point operations per cycle in vector mode.
>>>On one processor.  There are 16 of 'em.  No PC in the world can sustain anywhere
>>>near 8 * 16 * 250,000,000 FLOPS.
>>
>>P4 (and AMD64) hah 8 128-bit SSE2 registers that can be treated (among other
>>things) as 4 32-bit floats or 2 64-bit floats. You can do some operations on
>>those registers in parallel, for example you can add two float vectors of length
>>2 using one instruction. I am not sure if current P4 implementation performs
>>that addition in one cycle (that is definitely no so for Opteron/AMD64), but
>>nothing in theory prevents this.
>>
>>There are lot of limitations, though. Main one -- you cannot specify stride when
>>loading vector from memory, it should always be contigious.
>>
>>Thanks,
>>Eugene
>
>PS. Cray has vector registers that can keep up to 64 doubles, not just 2 doubles
>as P4/AMD64. It could load or store entire vector register using one
>instruction. And -- even more important -- it has matching memory bandwidth, and
>its memory subsystem scales well.
>
>Thanks,
>Eugene


Cray vector registers are 128 words (8 byte words) long since the C90 came
along.  Up through the YMP there were only 64 8-byte words per register.  And
it _might_ be that the YMP started the 128 word vectors now that I think about
it but I am not sure I still have an old YMP manual around.  I think the C90 is
now the oldest thing I have.




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.