Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: 64-way Parallel FP Chip

Author: Robert Hyatt
Date: 10:22:10 10/15/03
On October 15, 2003 at 12:21:17, Vincent Diepeveen wrote:

>On October 15, 2003 at 10:41:37, Robert Hyatt wrote:
>
>Bob,
>
>go to www.cray.com and see how many instructions this thing can put through a
>clock :)
>
>Nowadays they are 1Ghz and have a 256KB cache too.

First, we are talking about the C90.  It had no cache.  It executed one
instruction per clock.  As did/does the T90.  I have no idea what you are
looking at.  I am looking at the Cray Research publication CSM-0500-000.
I don't think you are going to find anything on the web site that contradicts
that.

Your number is ridiculous.  The Cray has 8 address registers.  8 scalar
registers.  8 vector registers.  When an instruction addresses any of those
as a destination, that register is unavailable for several clock cycles.  No
way to issue 19 instructions.  With only 8 vector registers, there is no way
to issue more than 8 vector instructions, even if it _could_.

Your comprehension is failing you.  Remember, I'm not guessing as you are,
I have actually programmed these things for 20 years.  I'll be happy to give
you a couple of names of software/hardware folks up there that will set you
straight, although I am sure you will argue with them as well.

>
>But just imagine that they would not be able to do 29 instructions a clock but
>just 1 or 2.

I don't have to imagine that.  I did it for 20 years.  The Cray 1 through the
T90 issued _one_ instruction per clock cycle.  Of course, once a single vector
instruction is issued, it executes for up to 64 clock cycles producing a new
floating point result every clock, and while this is going on, every cycle yet
another instruction can issue.  But _never_ more than one issue per cycle.
And I do mean _never_.  You might have several instructions busy at any one
cycle, but it only starts one new instruction every cycle _max_ and often less
than that.

>
>Then any x86 processor blows them away at floating point :)

You are stupid beyond belief.  I explained vector processing.  You are
still arguing "instructions per clock".  It is hopeless trying to explain
this to a brain-dead person that simply refuses to look something up and
read it carefully.

You are _wrong_.  You are almost always _wrong_.  And you will _continue_ to
be wrong on this subject until you grasp vectors and the difference between
"operations" and "instructions".  Until then it is hopeless...


>
>>On October 15, 2003 at 10:01:44, Vincent Diepeveen wrote:
>>
>>>On October 15, 2003 at 09:32:10, Robert Hyatt wrote:
>>>
>>>>On October 14, 2003 at 17:29:18, Vincent Diepeveen wrote:
>>>>
>>>>>On October 14, 2003 at 16:18:28, Robert Hyatt wrote:
>>>>>
>>>>>>On October 14, 2003 at 14:29:36, Gerd Isenberg wrote:
>>>>>>
>>>>>>>On October 14, 2003 at 14:15:33, Vincent Diepeveen wrote:
>>>>>>>
>>>>>>>>On October 14, 2003 at 14:13:08, Gerd Isenberg wrote:
>>>>>>>>
>>>>>>>>>On October 14, 2003 at 10:07:10, Ricardo Gibert wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>http://www.wired.com/news/technology/0,1282,60791,00.html
>>>>>>>>>>
>>>>>>>>>>Can this be productively used in a chess program?
>>>>>>>>>
>>>>>>>>>I don't know, simular hardware ressources may be more productive for chess, if
>>>>>>>>>implemented as hyperthreading devices. I guess it's a kind of further
>>>>>>>>>development of SSE and AltiVec technology. With huge register files
>>>>>>>>>(N * 64 * 64|128|256-bit?) and probably SIMD-wise integer instructions
>>>>>>>>>(including popcount?) and fast memory interface, i can imagine that it is
>>>>>>>>>usefull for a lot of nice things, like some eval passes, e.g. a first square
>>>>>>>>>wise and a final scalar product pass. And fill-attack generation, e.g. square
>>>>>>>>>wise in all 16 directions with a specialiced dumb fill routine.
>>>>>>>>>
>>>>>>>>>Gerd
>>>>>>>>
>>>>>>>>this is just floating point arrays.
>>>>>>>
>>>>>>>Aha, well may be a matter of interpretation.
>>>>>>>I havn't seen any instruction set yet.
>>>>>>>
>>>>>>>On the other hand, if float and double arithmetic becomes as fast (or faster) as
>>>>>>>integer, why not use it for eval purposes?
>>>>>>>
>>>>>>>Gerd
>>>>>>
>>>>>>
>>>>>>Correct.  We did this on the Cray.  FP was very fast there and it frees
>>>>>>up integer registers for addresses and array indices...
>>>>>
>>>>>That's of course true however at 16 processors of 100Mhz you reached 500k nodes
>>>>>a second with cray blitz.
>>>>>
>>>>>Each Cray processor can issue up to 29 instructions a cycle.
>>>>
>>>>I have no idea what you are talking about.  Each cray processor can issue
>>>>_one_ instruction per cycle.
>>>>
>>>>however, doing vector stuff, in one cycle the machine can do four memory
>>>>reads and two memory writes (8 byte words) per processor.  It can also do
>>>>multiple things in one cycle with vector chaining, but it never issues more
>>>>than one instruction per cycle per cpu.
>>>>
>>>>I don't know what data you are looking at, but it is wrong.
>>>>
>>>>>
>>>>>Crafty at a 1.6Ghz K7 which can issue up to 3 instructions a cycle gets 1
>>>>>million nodes a second.
>>>>>
>>>>>So something capable of 100M * 16 * 29 = 46.4G instructions a cycle you get 500k
>>>>>nps because it is a vector machine
>>>
>>>Bob cut the crap.
>>>
>>>If cray would execute 1 instruction a cycle then the processors would be
>>>10 times slower than any other solution.
>>
>>Vincent, wanna make a bet?  Any amount of money you care to put on it.
>>
>>The cray issues one instruction per cycle.  Of course, you have _no idea_
>>of what a vector machine does and how it does it, so you aren't going to
>>understand anything about the machine.  But one instruction per cycle per
>>processor is _it_.
>>
>>You can find this in any good Cray Reference.  I'll be happy to xerox a page
>>from the C90 hardware reference manual that gives this info.
>>
>>Next, do you understand the difference between an _instruction_ and an
>>_operation_?  Didn't think so.  The cray has a set of vector instructions
>>where _one_ instruction produces multiple results by operating on a vector.
>>But it can't _issue_ more than one instruction per cycle.  It is possible that
>>by issuing multiple consecutive instructions, you "chain" vector functional
>>units together and produce multiple _operations_ per cycle.  But _not_
>>multiple instructions.
>>
>>Why don't you try to talk about something you know something about, if there
>>is such a topic?  And stop trying to talk "cray" to someone that has actually
>>_used_ them for 20+ years?
>>
>>>
>>>Yet everyone loves crays because they are vector processors which can do up to
>>>29 instructions a cycle.
>>
>>Nope.
>>
>>One instruction per cycle.  Try this on for size:
>>
>>Cray Y-MP C90 System Programmer Reference Manual, CSM-0500-000
>>
>>"A fetch sequence begins immediately and transfers a block of instructions
>>from memory to an instruction buffer.  The issue sequence then selects the
>>instruction indicated by the program address (P) register, decodes it,
>>determines whether the required registers or functional units are available,
>>and if so, allows the instruction to be executed.
>>
>>As the instruction executes, the P register increments, causing a new
>>instruction to be selected from the instruction buffer."
>>
>>The above happens _once_ per processor cycle.
>>
>>Again, you don't understand what vector processing is all about.
>>
>>>
>>>Even a P5/100 would have been faster than a cray because it can do 2
>>>instructions a clock at 100Mhz.
>>
>>
>>So?  How long would it take that P5/100 to execute (say) a floating point
>>add?  The cray does one in 3 cycles.  But if it is a vector instruction,
>>afther the first result pops out after 3 cycles, the next result pops out
>>one cycle later, and this continues until the vector has been completely
>>processed.
>>
>>Can your P5 do one floating add per cycle?  Didn't think so.  After you
>>issue several floating point vector instructions (here is an example):
>>
>>            v0     v1+v2
>>            v3     v4+v5
>>            v6     v0*v3
>>
>>after three cycles, we have three instructions being executed, one issued
>>per cycle.  after three cycles, the first v0 value is completed and a new
>>one is completed every cycle after that.  After 4 cycles, the first v3 value
>>is completed and one is completed every cycle after that.  After 8 cycles,
>>the first v6 value is completed and one every cycle after that.  From this
>>point forward, we are doing two floating adds and one floating multiply
>>every clock cycle.  Can your P5 do that?
>>
>>The cray was _not_ a fast scalar machine.  Again something you don't understand.
>>It _is_ one hell of a fast vector machine, if you would only look.
>>
>>
>>
>>
>>
>>
>>>
>>>You know that a cray can do 29 and i do. So cut this incredible nonsense right
>>>here.
>>
>>You are producing the nonsense.  I just quoted _directly_ from the C90
>>manual and that was the machine you were quoting for my 500K nodes per
>>second.
>>
>>
>>>
>>>If you would have vectorized cray blitz correctly it would have run of course
>>>faster than 500k nps. More like 5MLN nps at a 16 processor 100Mhz cray.
>>
>>you have no idea what "vectorized CB correctly" means, obviously, since you
>>don't have a clue what "vectorized" means as you have shown many times over
>>the past 8 years.
>>
>>Grow up and learn to understand before spouting nonsense.
>>
>>
>>>
>>>Thank you,
>>>Vincent
>>>
>>>>Again, you make up numbers that have nothing to do with reality.  A Cray
>>>>can issue one instruction per cycle.  The C90 I used for the ICCA DTS
>>>>article had a clock cycle time of 4.167 nanoseconds, the standard C90 clock
>>>>speed.  That is about 250 million instructions per second per processor.  With
>>>>16 processors, that is 4 billion, not your mythical 46.4 billion.  How about
>>>>you start writing about things you know something about, and stop making stuff
>>>>up about things you don't have a clue about?
>>>>
>>>>>
>>>>>Something capable of 4.8G instructions a cycle you get 1 MLN nps because it is a
>>>>>x86 processor.
>>>>
>>>>
>>>>Pure garbage calculations don't convince anybody of anything.
Re: 64-way Parallel FP Chip Vincent Diepeveen 13:15:39 10/15/03
- Re: 64-way Parallel FP Chip Robert Hyatt 15:04:35 10/15/03
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.