Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: P4?

Author: Robert Hyatt

Date: 10:26:11 10/15/03

Go up one level in this thread


On October 15, 2003 at 11:50:10, Anthony Cozzie wrote:

>On October 15, 2003 at 10:51:29, Robert Hyatt wrote:
>
>>
>>I blew one bit of the previous calculation.  The C90 is a "super-scalar" sort
>>of vector machine.  Where I said "one floating add per cycle" change that to
>>two.  A single vector instruction does _two_ operations per cycle, not one, and
>>I had simply failed to note that.  That was the main change from the older
>>X-MP and Y-MP, that was introduced on the C90.  Obviously it makes vector
>>performance 2x faster even without the clock speed improvement.  IE for my
>>example:
>>
>>      v0    v1+v2
>>      v3    v4+v5
>>      v6    v0*v3
>>
>>that code will produce _six_ results per cycle, once the chained vector
>>pipeline is filled.  Not the _three_ I had given.
>>
>>_that_ is why the Cray buries the PC in _any_ program that can use vectors.
>>Even though the C90 only runs at 250 mhz.  The T90 runs that up to 500mhz,
>>and the Cray-3 doubled it again to 1ghz.  But all mhz/ghz are _not_ created
>>"equal" for those that understand vector operations.
>>
>>The C90 is a 250mhz machine, not the 100 Vincent pulls from you-know-where.
>>But no 2500mhz 80x86 can produce 6 64-bit IEEE floating point operations
>>every 4 nanoseconds.
>>
>>I don't know how to explain it better to someone that simply doesn't have a
>>single scintilla of background on understanding the concept of "a vector
>>machine."
>
>What about P4 with SSE2?
>
>According to my P4 optimization manual, P4 can do 2 DP adds in parallel, with
>latency = 4. I *think* that the SSE2 ALUs in P4 are fully pipelined, so that
>means 4 FP ops/clock.  Obviously it can only do this on vectorized data, but the
>same constraint applies to the cray. Unfortunately, P4 was built do vectorized
>multimedia-ish stuff, not computer chess :(
>
>anthony

The P4 can't do 4 ops/clock.  Look at the FP processor architecture.  It is
stack-based.  You execute an instruction that says (say) add the top of the
stack and the contents of memory address X.  Or whatever.  And you are busy
for a while.

Just compare the peak FLOPS rating for a P4 vs a Cray C90, which is the machine
we are talking about.  The T90 is 4x faster but that's not the machine of the
early 90's we are talking about with respect to CB.

The C90 easily sustains 8 floating point operations per cycle in vector mode.
On one processor.  There are 16 of 'em.  No PC in the world can sustain anywhere
near 8 * 16 * 250,000,000 FLOPS.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.