Author: Robert Hyatt
Date: 13:16:06 10/15/03
Go up one level in this thread
On October 15, 2003 at 13:37:08, Eugene Nalimov wrote: >On October 15, 2003 at 13:33:36, Eugene Nalimov wrote: > >>On October 15, 2003 at 13:26:11, Robert Hyatt wrote: >> >>>On October 15, 2003 at 11:50:10, Anthony Cozzie wrote: >>> >>>>On October 15, 2003 at 10:51:29, Robert Hyatt wrote: >>>> >>>>> >>>>>I blew one bit of the previous calculation. The C90 is a "super-scalar" sort >>>>>of vector machine. Where I said "one floating add per cycle" change that to >>>>>two. A single vector instruction does _two_ operations per cycle, not one, and >>>>>I had simply failed to note that. That was the main change from the older >>>>>X-MP and Y-MP, that was introduced on the C90. Obviously it makes vector >>>>>performance 2x faster even without the clock speed improvement. IE for my >>>>>example: >>>>> >>>>> v0 v1+v2 >>>>> v3 v4+v5 >>>>> v6 v0*v3 >>>>> >>>>>that code will produce _six_ results per cycle, once the chained vector >>>>>pipeline is filled. Not the _three_ I had given. >>>>> >>>>>_that_ is why the Cray buries the PC in _any_ program that can use vectors. >>>>>Even though the C90 only runs at 250 mhz. The T90 runs that up to 500mhz, >>>>>and the Cray-3 doubled it again to 1ghz. But all mhz/ghz are _not_ created >>>>>"equal" for those that understand vector operations. >>>>> >>>>>The C90 is a 250mhz machine, not the 100 Vincent pulls from you-know-where. >>>>>But no 2500mhz 80x86 can produce 6 64-bit IEEE floating point operations >>>>>every 4 nanoseconds. >>>>> >>>>>I don't know how to explain it better to someone that simply doesn't have a >>>>>single scintilla of background on understanding the concept of "a vector >>>>>machine." >>>> >>>>What about P4 with SSE2? >>>> >>>>According to my P4 optimization manual, P4 can do 2 DP adds in parallel, with >>>>latency = 4. I *think* that the SSE2 ALUs in P4 are fully pipelined, so that >>>>means 4 FP ops/clock. Obviously it can only do this on vectorized data, but the >>>>same constraint applies to the cray. Unfortunately, P4 was built do vectorized >>>>multimedia-ish stuff, not computer chess :( >>>> >>>>anthony >>> >>>The P4 can't do 4 ops/clock. Look at the FP processor architecture. It is >>>stack-based. You execute an instruction that says (say) add the top of the >>>stack and the contents of memory address X. Or whatever. And you are busy >>>for a while. >>> >>>Just compare the peak FLOPS rating for a P4 vs a Cray C90, which is the machine >>>we are talking about. The T90 is 4x faster but that's not the machine of the >>>early 90's we are talking about with respect to CB. >>> >>>The C90 easily sustains 8 floating point operations per cycle in vector mode. >>>On one processor. There are 16 of 'em. No PC in the world can sustain anywhere >>>near 8 * 16 * 250,000,000 FLOPS. >> >>P4 (and AMD64) hah 8 128-bit SSE2 registers that can be treated (among other >>things) as 4 32-bit floats or 2 64-bit floats. You can do some operations on >>those registers in parallel, for example you can add two float vectors of length >>2 using one instruction. I am not sure if current P4 implementation performs >>that addition in one cycle (that is definitely no so for Opteron/AMD64), but >>nothing in theory prevents this. >> >>There are lot of limitations, though. Main one -- you cannot specify stride when >>loading vector from memory, it should always be contigious. >> >>Thanks, >>Eugene > >PS. Cray has vector registers that can keep up to 64 doubles, not just 2 doubles >as P4/AMD64. It could load or store entire vector register using one >instruction. And -- even more important -- it has matching memory bandwidth, and >its memory subsystem scales well. > >Thanks, >Eugene Cray vector registers are 128 words (8 byte words) long since the C90 came along. Up through the YMP there were only 64 8-byte words per register. And it _might_ be that the YMP started the 128 word vectors now that I think about it but I am not sure I still have an old YMP manual around. I think the C90 is now the oldest thing I have.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.