Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: 64-way Parallel FP Chip

Author: Robert Hyatt
Date: 15:04:35 10/15/03
On October 15, 2003 at 16:15:39, Vincent Diepeveen wrote:

>On October 15, 2003 at 13:22:10, Robert Hyatt wrote:
>
>>On October 15, 2003 at 12:21:17, Vincent Diepeveen wrote:
>>
>>>On October 15, 2003 at 10:41:37, Robert Hyatt wrote:
>>>
>>>Bob,
>>>
>>>go to www.cray.com and see how many instructions this thing can put through a
>>>clock :)
>>>
>>>Nowadays they are 1Ghz and have a 256KB cache too.
>>
>>First, we are talking about the C90.  It had no cache.  It executed one
>>instruction per clock.  As did/does the T90.  I have no idea what you are
>>looking at.  I am looking at the Cray Research publication CSM-0500-000.
>>I don't think you are going to find anything on the web site that contradicts
>>that.
>
>Hell your memory again. It is so very bad. You misgambled again. Yes i have
>little time to figure it out again, but i had already figured out what a Cray
>can and cannot do for DIEP as i could get one or this origin3800.

yes, it is always my memory.  Even when I have the hardware manual right beside
me I can't remember the numbers long enough to copy them from the printed page
to my machine.  I gave you a link to Cray's web site that shows that the C90
is rated by Cray at 1.5GFLOPS per cpu.  At 250mhz (4 nanoseconds per clock)
that is six operations per cycle.  Not 15.


>
>A single C90 processor has a bunch of functional units.
>15 to be precise.

Yes, but you can't get 'em all busy at the same time.  Read the stuff at
the link I posted.


>
>Each of them can in parallel do an operation and ship it to the register.

No they can't.  There are only 8 vector registers.  Try again.


>
>So in contradiction to the latest crays which could can do 29, doing 15 is still
>awesome.

None are doing 29.


>
>So that's 15 instructions a clock.

And no it isn't.  It is six _operations_ a clock peak.  And an operation
is _not_ an instruction.

>
>No way to get around that.

Unless I try to do as you and make up stuff right and left.  But Cray's web
site doesn't lie.


>
>Most important is at what speed all this beautiful gets done and the answer to
>that is 8.5 nanoseconds.

Please stick to one machine.  We have been talking about the C90 that ran
Cray Blitz at 500K nodes per second.  8.5 nanoseconds was the speed of the
Cray XMP that we used to win the 1983 WCCC event.  The C90 runs at 4 nanoseconds
per clock.


>
>Most interesting thing is that there are no weak chains in those processors.
>Each processor runs on and on. You had 16 of them.

Correct, finally...


>
>Each one of them delivered 0.95 Gflop.

will you _please_ get on the calculator?  First you say 15 operations
per clock, now you say just under 1 Gigaflop which means you are changing
machines and clock speeds faster than redhat changes linux.


>
>And you got with 16 of them only 500k nps.
>
>So you had 16 Gflop nearly to your avail and you got with that
>just 500k nps.

Do you understand the difference between a vectorized GigaFLOP and
a scalar operation?  Of course you don't...


>
>That's trivially showing Cray Blitz was not vectorized very well.

I never said it was.  Vectors work fine for parts of the evaluation.  For
the SEE (Swap()) code.  For move generation.  The search doesn't vectorize
at all.  Nor does many other things.  But before you can understand that,
you have to understand vector processing.  Which you don't.  So you can't.
And apparently you won't ever.




>Crafty at a plain x86 where 1 lookup to RAM is real hell slow,
>gets already 1 MLN nps hands down.

So?  On a 1ghz X86, the clock cycle time is 1 ns.  Way faster than the
cray at executing scalar instructions.  That's no surprise to anyone that
knows what is going on...


>
>Look for example at: http://www.asc.edu/usermanual/C94.html to freshen up your
>memory.

I don't need to.  I have the C90 numbers nicely stored away mentally.  I ran
on it for years.

Of course flap those arms.  Obfuscate rather than clarify...  Etc.  It is
your mode of operation, after all.  For the rest of the logical folks here,
follow the link I posted previously and learn all about FLOPS and cycle
times and the like.  Here it is again for those so interested:

http://www.cray.com/craydoc/manuals/004-2182-002/html-004-2182-002/zfixed1qzdhueg.html#U8WKLCHRI
>
>>Your number is ridiculous.  The Cray has 8 address registers.  8 scalar
>>registers.  8 vector registers.  When an instruction addresses any of those
>>as a destination, that register is unavailable for several clock cycles.  No
>>way to issue 19 instructions.  With only 8 vector registers, there is no way
>>to issue more than 8 vector instructions, even if it _could_.
>>
>>Your comprehension is failing you.  Remember, I'm not guessing as you are,
>>I have actually programmed these things for 20 years.  I'll be happy to give
>>you a couple of names of software/hardware folks up there that will set you
>>straight, although I am sure you will argue with them as well.
>>
>>>
>>>But just imagine that they would not be able to do 29 instructions a clock but
>>>just 1 or 2.
>>
>>I don't have to imagine that.  I did it for 20 years.  The Cray 1 through the
>>T90 issued _one_ instruction per clock cycle.  Of course, once a single vector
>>instruction is issued, it executes for up to 64 clock cycles producing a new
>>floating point result every clock, and while this is going on, every cycle yet
>>another instruction can issue.  But _never_ more than one issue per cycle.
>>And I do mean _never_.  You might have several instructions busy at any one
>>cycle, but it only starts one new instruction every cycle _max_ and often less
>>than that.
>>
>>>
>>>Then any x86 processor blows them away at floating point :)
>>
>>You are stupid beyond belief.  I explained vector processing.  You are
>>still arguing "instructions per clock".  It is hopeless trying to explain
>>this to a brain-dead person that simply refuses to look something up and
>>read it carefully.
>>
>>You are _wrong_.  You are almost always _wrong_.  And you will _continue_ to
>>be wrong on this subject until you grasp vectors and the difference between
>>"operations" and "instructions".  Until then it is hopeless...
>>
>>
>>>
>>>>On October 15, 2003 at 10:01:44, Vincent Diepeveen wrote:
>>>>
>>>>>On October 15, 2003 at 09:32:10, Robert Hyatt wrote:
>>>>>
>>>>>>On October 14, 2003 at 17:29:18, Vincent Diepeveen wrote:
>>>>>>
>>>>>>>On October 14, 2003 at 16:18:28, Robert Hyatt wrote:
>>>>>>>
>>>>>>>>On October 14, 2003 at 14:29:36, Gerd Isenberg wrote:
>>>>>>>>
>>>>>>>>>On October 14, 2003 at 14:15:33, Vincent Diepeveen wrote:
>>>>>>>>>
>>>>>>>>>>On October 14, 2003 at 14:13:08, Gerd Isenberg wrote:
>>>>>>>>>>
>>>>>>>>>>>On October 14, 2003 at 10:07:10, Ricardo Gibert wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>http://www.wired.com/news/technology/0,1282,60791,00.html
>>>>>>>>>>>>
>>>>>>>>>>>>Can this be productively used in a chess program?
>>>>>>>>>>>
>>>>>>>>>>>I don't know, simular hardware ressources may be more productive for chess, if
>>>>>>>>>>>implemented as hyperthreading devices. I guess it's a kind of further
>>>>>>>>>>>development of SSE and AltiVec technology. With huge register files
>>>>>>>>>>>(N * 64 * 64|128|256-bit?) and probably SIMD-wise integer instructions
>>>>>>>>>>>(including popcount?) and fast memory interface, i can imagine that it is
>>>>>>>>>>>usefull for a lot of nice things, like some eval passes, e.g. a first square
>>>>>>>>>>>wise and a final scalar product pass. And fill-attack generation, e.g. square
>>>>>>>>>>>wise in all 16 directions with a specialiced dumb fill routine.
>>>>>>>>>>>
>>>>>>>>>>>Gerd
>>>>>>>>>>
>>>>>>>>>>this is just floating point arrays.
>>>>>>>>>
>>>>>>>>>Aha, well may be a matter of interpretation.
>>>>>>>>>I havn't seen any instruction set yet.
>>>>>>>>>
>>>>>>>>>On the other hand, if float and double arithmetic becomes as fast (or faster) as
>>>>>>>>>integer, why not use it for eval purposes?
>>>>>>>>>
>>>>>>>>>Gerd
>>>>>>>>
>>>>>>>>
>>>>>>>>Correct.  We did this on the Cray.  FP was very fast there and it frees
>>>>>>>>up integer registers for addresses and array indices...
>>>>>>>
>>>>>>>That's of course true however at 16 processors of 100Mhz you reached 500k nodes
>>>>>>>a second with cray blitz.
>>>>>>>
>>>>>>>Each Cray processor can issue up to 29 instructions a cycle.
>>>>>>
>>>>>>I have no idea what you are talking about.  Each cray processor can issue
>>>>>>_one_ instruction per cycle.
>>>>>>
>>>>>>however, doing vector stuff, in one cycle the machine can do four memory
>>>>>>reads and two memory writes (8 byte words) per processor.  It can also do
>>>>>>multiple things in one cycle with vector chaining, but it never issues more
>>>>>>than one instruction per cycle per cpu.
>>>>>>
>>>>>>I don't know what data you are looking at, but it is wrong.
>>>>>>
>>>>>>>
>>>>>>>Crafty at a 1.6Ghz K7 which can issue up to 3 instructions a cycle gets 1
>>>>>>>million nodes a second.
>>>>>>>
>>>>>>>So something capable of 100M * 16 * 29 = 46.4G instructions a cycle you get 500k
>>>>>>>nps because it is a vector machine
>>>>>
>>>>>Bob cut the crap.
>>>>>
>>>>>If cray would execute 1 instruction a cycle then the processors would be
>>>>>10 times slower than any other solution.
>>>>
>>>>Vincent, wanna make a bet?  Any amount of money you care to put on it.
>>>>
>>>>The cray issues one instruction per cycle.  Of course, you have _no idea_
>>>>of what a vector machine does and how it does it, so you aren't going to
>>>>understand anything about the machine.  But one instruction per cycle per
>>>>processor is _it_.
>>>>
>>>>You can find this in any good Cray Reference.  I'll be happy to xerox a page
>>>>from the C90 hardware reference manual that gives this info.
>>>>
>>>>Next, do you understand the difference between an _instruction_ and an
>>>>_operation_?  Didn't think so.  The cray has a set of vector instructions
>>>>where _one_ instruction produces multiple results by operating on a vector.
>>>>But it can't _issue_ more than one instruction per cycle.  It is possible that
>>>>by issuing multiple consecutive instructions, you "chain" vector functional
>>>>units together and produce multiple _operations_ per cycle.  But _not_
>>>>multiple instructions.
>>>>
>>>>Why don't you try to talk about something you know something about, if there
>>>>is such a topic?  And stop trying to talk "cray" to someone that has actually
>>>>_used_ them for 20+ years?
>>>>
>>>>>
>>>>>Yet everyone loves crays because they are vector processors which can do up to
>>>>>29 instructions a cycle.
>>>>
>>>>Nope.
>>>>
>>>>One instruction per cycle.  Try this on for size:
>>>>
>>>>Cray Y-MP C90 System Programmer Reference Manual, CSM-0500-000
>>>>
>>>>"A fetch sequence begins immediately and transfers a block of instructions
>>>>from memory to an instruction buffer.  The issue sequence then selects the
>>>>instruction indicated by the program address (P) register, decodes it,
>>>>determines whether the required registers or functional units are available,
>>>>and if so, allows the instruction to be executed.
>>>>
>>>>As the instruction executes, the P register increments, causing a new
>>>>instruction to be selected from the instruction buffer."
>>>>
>>>>The above happens _once_ per processor cycle.
>>>>
>>>>Again, you don't understand what vector processing is all about.
>>>>
>>>>>
>>>>>Even a P5/100 would have been faster than a cray because it can do 2
>>>>>instructions a clock at 100Mhz.
>>>>
>>>>
>>>>So?  How long would it take that P5/100 to execute (say) a floating point
>>>>add?  The cray does one in 3 cycles.  But if it is a vector instruction,
>>>>afther the first result pops out after 3 cycles, the next result pops out
>>>>one cycle later, and this continues until the vector has been completely
>>>>processed.
>>>>
>>>>Can your P5 do one floating add per cycle?  Didn't think so.  After you
>>>>issue several floating point vector instructions (here is an example):
>>>>
>>>>            v0     v1+v2
>>>>            v3     v4+v5
>>>>            v6     v0*v3
>>>>
>>>>after three cycles, we have three instructions being executed, one issued
>>>>per cycle.  after three cycles, the first v0 value is completed and a new
>>>>one is completed every cycle after that.  After 4 cycles, the first v3 value
>>>>is completed and one is completed every cycle after that.  After 8 cycles,
>>>>the first v6 value is completed and one every cycle after that.  From this
>>>>point forward, we are doing two floating adds and one floating multiply
>>>>every clock cycle.  Can your P5 do that?
>>>>
>>>>The cray was _not_ a fast scalar machine.  Again something you don't understand.
>>>>It _is_ one hell of a fast vector machine, if you would only look.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>You know that a cray can do 29 and i do. So cut this incredible nonsense right
>>>>>here.
>>>>
>>>>You are producing the nonsense.  I just quoted _directly_ from the C90
>>>>manual and that was the machine you were quoting for my 500K nodes per
>>>>second.
>>>>
>>>>
>>>>>
>>>>>If you would have vectorized cray blitz correctly it would have run of course
>>>>>faster than 500k nps. More like 5MLN nps at a 16 processor 100Mhz cray.
>>>>
>>>>you have no idea what "vectorized CB correctly" means, obviously, since you
>>>>don't have a clue what "vectorized" means as you have shown many times over
>>>>the past 8 years.
>>>>
>>>>Grow up and learn to understand before spouting nonsense.
>>>>
>>>>
>>>>>
>>>>>Thank you,
>>>>>Vincent
>>>>>
>>>>>>Again, you make up numbers that have nothing to do with reality.  A Cray
>>>>>>can issue one instruction per cycle.  The C90 I used for the ICCA DTS
>>>>>>article had a clock cycle time of 4.167 nanoseconds, the standard C90 clock
>>>>>>speed.  That is about 250 million instructions per second per processor.  With
>>>>>>16 processors, that is 4 billion, not your mythical 46.4 billion.  How about
>>>>>>you start writing about things you know something about, and stop making stuff
>>>>>>up about things you don't have a clue about?
>>>>>>
>>>>>>>
>>>>>>>Something capable of 4.8G instructions a cycle you get 1 MLN nps because it is a
>>>>>>>x86 processor.
>>>>>>
>>>>>>
>>>>>>Pure garbage calculations don't convince anybody of anything.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.