Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: 64-way Parallel FP Chip

Author: Robert Hyatt

Date: 07:41:37 10/15/03

Go up one level in this thread


On October 15, 2003 at 10:01:44, Vincent Diepeveen wrote:

>On October 15, 2003 at 09:32:10, Robert Hyatt wrote:
>
>>On October 14, 2003 at 17:29:18, Vincent Diepeveen wrote:
>>
>>>On October 14, 2003 at 16:18:28, Robert Hyatt wrote:
>>>
>>>>On October 14, 2003 at 14:29:36, Gerd Isenberg wrote:
>>>>
>>>>>On October 14, 2003 at 14:15:33, Vincent Diepeveen wrote:
>>>>>
>>>>>>On October 14, 2003 at 14:13:08, Gerd Isenberg wrote:
>>>>>>
>>>>>>>On October 14, 2003 at 10:07:10, Ricardo Gibert wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>http://www.wired.com/news/technology/0,1282,60791,00.html
>>>>>>>>
>>>>>>>>Can this be productively used in a chess program?
>>>>>>>
>>>>>>>I don't know, simular hardware ressources may be more productive for chess, if
>>>>>>>implemented as hyperthreading devices. I guess it's a kind of further
>>>>>>>development of SSE and AltiVec technology. With huge register files
>>>>>>>(N * 64 * 64|128|256-bit?) and probably SIMD-wise integer instructions
>>>>>>>(including popcount?) and fast memory interface, i can imagine that it is
>>>>>>>usefull for a lot of nice things, like some eval passes, e.g. a first square
>>>>>>>wise and a final scalar product pass. And fill-attack generation, e.g. square
>>>>>>>wise in all 16 directions with a specialiced dumb fill routine.
>>>>>>>
>>>>>>>Gerd
>>>>>>
>>>>>>this is just floating point arrays.
>>>>>
>>>>>Aha, well may be a matter of interpretation.
>>>>>I havn't seen any instruction set yet.
>>>>>
>>>>>On the other hand, if float and double arithmetic becomes as fast (or faster) as
>>>>>integer, why not use it for eval purposes?
>>>>>
>>>>>Gerd
>>>>
>>>>
>>>>Correct.  We did this on the Cray.  FP was very fast there and it frees
>>>>up integer registers for addresses and array indices...
>>>
>>>That's of course true however at 16 processors of 100Mhz you reached 500k nodes
>>>a second with cray blitz.
>>>
>>>Each Cray processor can issue up to 29 instructions a cycle.
>>
>>I have no idea what you are talking about.  Each cray processor can issue
>>_one_ instruction per cycle.
>>
>>however, doing vector stuff, in one cycle the machine can do four memory
>>reads and two memory writes (8 byte words) per processor.  It can also do
>>multiple things in one cycle with vector chaining, but it never issues more
>>than one instruction per cycle per cpu.
>>
>>I don't know what data you are looking at, but it is wrong.
>>
>>>
>>>Crafty at a 1.6Ghz K7 which can issue up to 3 instructions a cycle gets 1
>>>million nodes a second.
>>>
>>>So something capable of 100M * 16 * 29 = 46.4G instructions a cycle you get 500k
>>>nps because it is a vector machine
>
>Bob cut the crap.
>
>If cray would execute 1 instruction a cycle then the processors would be
>10 times slower than any other solution.

Vincent, wanna make a bet?  Any amount of money you care to put on it.

The cray issues one instruction per cycle.  Of course, you have _no idea_
of what a vector machine does and how it does it, so you aren't going to
understand anything about the machine.  But one instruction per cycle per
processor is _it_.

You can find this in any good Cray Reference.  I'll be happy to xerox a page
from the C90 hardware reference manual that gives this info.

Next, do you understand the difference between an _instruction_ and an
_operation_?  Didn't think so.  The cray has a set of vector instructions
where _one_ instruction produces multiple results by operating on a vector.
But it can't _issue_ more than one instruction per cycle.  It is possible that
by issuing multiple consecutive instructions, you "chain" vector functional
units together and produce multiple _operations_ per cycle.  But _not_
multiple instructions.

Why don't you try to talk about something you know something about, if there
is such a topic?  And stop trying to talk "cray" to someone that has actually
_used_ them for 20+ years?

>
>Yet everyone loves crays because they are vector processors which can do up to
>29 instructions a cycle.

Nope.

One instruction per cycle.  Try this on for size:

Cray Y-MP C90 System Programmer Reference Manual, CSM-0500-000

"A fetch sequence begins immediately and transfers a block of instructions
from memory to an instruction buffer.  The issue sequence then selects the
instruction indicated by the program address (P) register, decodes it,
determines whether the required registers or functional units are available,
and if so, allows the instruction to be executed.

As the instruction executes, the P register increments, causing a new
instruction to be selected from the instruction buffer."

The above happens _once_ per processor cycle.

Again, you don't understand what vector processing is all about.

>
>Even a P5/100 would have been faster than a cray because it can do 2
>instructions a clock at 100Mhz.


So?  How long would it take that P5/100 to execute (say) a floating point
add?  The cray does one in 3 cycles.  But if it is a vector instruction,
afther the first result pops out after 3 cycles, the next result pops out
one cycle later, and this continues until the vector has been completely
processed.

Can your P5 do one floating add per cycle?  Didn't think so.  After you
issue several floating point vector instructions (here is an example):

            v0     v1+v2
            v3     v4+v5
            v6     v0*v3

after three cycles, we have three instructions being executed, one issued
per cycle.  after three cycles, the first v0 value is completed and a new
one is completed every cycle after that.  After 4 cycles, the first v3 value
is completed and one is completed every cycle after that.  After 8 cycles,
the first v6 value is completed and one every cycle after that.  From this
point forward, we are doing two floating adds and one floating multiply
every clock cycle.  Can your P5 do that?

The cray was _not_ a fast scalar machine.  Again something you don't understand.
It _is_ one hell of a fast vector machine, if you would only look.






>
>You know that a cray can do 29 and i do. So cut this incredible nonsense right
>here.

You are producing the nonsense.  I just quoted _directly_ from the C90
manual and that was the machine you were quoting for my 500K nodes per
second.


>
>If you would have vectorized cray blitz correctly it would have run of course
>faster than 500k nps. More like 5MLN nps at a 16 processor 100Mhz cray.

you have no idea what "vectorized CB correctly" means, obviously, since you
don't have a clue what "vectorized" means as you have shown many times over
the past 8 years.

Grow up and learn to understand before spouting nonsense.


>
>Thank you,
>Vincent
>
>>Again, you make up numbers that have nothing to do with reality.  A Cray
>>can issue one instruction per cycle.  The C90 I used for the ICCA DTS
>>article had a clock cycle time of 4.167 nanoseconds, the standard C90 clock
>>speed.  That is about 250 million instructions per second per processor.  With
>>16 processors, that is 4 billion, not your mythical 46.4 billion.  How about
>>you start writing about things you know something about, and stop making stuff
>>up about things you don't have a clue about?
>>
>>>
>>>Something capable of 4.8G instructions a cycle you get 1 MLN nps because it is a
>>>x86 processor.
>>
>>
>>Pure garbage calculations don't convince anybody of anything.



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.