Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: ASCI White vs. Deep Blue

Author: Vincent Diepeveen

Date: 15:38:38 09/24/01

Go up one level in this thread


On September 23, 2001 at 22:36:38, Robert Hyatt wrote:

>On September 23, 2001 at 18:20:30, Vincent Diepeveen wrote:
>
>>On September 23, 2001 at 15:30:08, Lonnie Cook wrote:
>>
>>>* It weighs 106 tons
>>>
>>>* costs 110M for the unit itself (doesn't include the ungodly sum to run it
>>>every day)
>>>
>>>* Has 8,192 IBM Power3 processors
>>
>>>* 12.3 trillion ops per sec.
>>>
>>>* took 28 tractor-trailer trucks to deliver
>>>
>>>this was the part that astounded me. It said it was 1,000 X's faster than Deep
>>>Blue!!
>>>
>>>so we're talking about a machine that in theory could do 200,000,000,000 nps!!
>>
>>Noop.
>>
>>IBM power3 processors. i do not know what speed they run at. Let's guess
>>they run at 375Mhz. Hehe , a cheated guess kind of.
>>
>>Now i have some numbers on these processors, but those are a few years old
>>of course. These processors suck bigtime of course. NO one wants to run
>>on 375Mhz processors nowadays. But well let's assume that at a stupid
>>cluster which ASCI white is, that you can get a decent speedup.
>>
>>Now how fast do i run at 1 node? Well that's like 15k nodes a second.
>
>That math is bad.  I'll "race" you using any PIV of your choice, me using
>an 800mhz 21264 of my choice.  And my lowly 800mhz processor will toast your
>doors off.

I'll take the bet but not with a P4, but a K7 from 1.4Ghz.
I can already proof with kind of induction that for DIEP the
K7 is faster.

The so much praised 21164 at 633Mhz is for DIEP the same like
a PII clocked at 380Mhz.

A PIII is 17.3% faster than that.
An athlon is another 7% faster than the PIII
An MP athlon is a few % faster than an athlon.

I'll equip the K7 MP 1.4ghz with DDR ram, so that the memory also
isn't the weak chain.

Now the K7 can do at most 3 instructions a clock and comes pretty
close to that.

The 21264 has a way longer stage but can do 4 instructions a clock.

So on paper an 21264 can be *at most* 33% faster than a processor
doing 3 instructions a clock.

However the 21264 has some bad habits
  a) generating assembly for it is hell difficult, i don't doubt
     that DEC did a good job here.
  b) the penalty for a misprediction is *huge*
  c) the 21264 does *not* have more BTB tables than the K7 MP.

Now you'll claim crafty is factor 2 faster at it, this means only that you
designed your program wrong for the 32 bits processors.

Now Alpha 21264 was a pretty good design compared to other processors,
according to major experts. too bad that we won't see any alpha processors
anymore, perhaps good as well, because the thing sucked bigtime for me
always.

A K7 MP from 1.4Ghz is of course going to blow away an IBM processor
from 375Mhz.

To the left, to the right, top and bottom.

Add to that huge parallel losses and that cluster communication is
not making up soon for more Mhz, then you know how bigtime you're dicked
with this machine.

dual 1.4ghz x 2 = 2.8Ghz.
to get to 2.8 Ghz with 375Mhz you need 7.5 processors.

However considering parallel loss, loss over the network, you need
more like 64 processors or so to make up for that.

>Don't just assume that 375mhz is bad.  The PPC is _not_ a bad machine. I
>have run on SP's...

You designed a 64 bits program!

I do not know which application they planned to run on this thing, but
obviously a good programmer can do the same at a dual 1.4Ghz MP K7 easily.

Most likely they use some kind of badly programmed thing which works correct
and that then the assumption is that such particle calculations only need
to get approached, simply taking the error which happens because you can't
profit from shared memory!

So probably the whole model used at it sucks bigtime.

In molecular physics some 'leading' scientists used a lineair approximation
for matrix calculations. Weird behaviour of the model was then explained
by some weird lemma's. Recently a bright doctor Sieds Zijlstra however
showed that by using a better program without bad approximations but by
using exact matrix calculations using a way faster programming language
library, it was possible to get rid of all the weird behaviour and simply
refute all the weird lemma's!

These reports you of course keep hearing. In short, those machines are
probably going to idle, and do unuseful things like factorizing (which
can be done *at least* 10000x faster in selfmade hardware with build in
prime base).

Everyone can imagine how the machine started to exist. "We need a super
machine that kicks the hell out of everyone, it must be bigger!"

Salesman: "Ah you want more processors than anyone else!!"

"Sure"

Salesman: "8192 sounds ok to you?"

"Excellent!"

There were of course other problems: this machine needed to be produced
by IBM. Some 10000 processor thing from intel already existed.
It is called ASCI RED. It had 450Mhz Xeon processors. Now i don't
doubt that 375Mhz processors could possibly overwhelm a 450Mhz 32 bits
processor (which btw is 20% slower than a K7 MP would be at 450Mhz) at
certain applications like 64 bits math.

I've been working on Sun processors for years, and forgive me i do not
remember the types, but when the department bought new machines, they were
300Mhz. At that time i had a 450Mhz pii at home, and it was very quickly
clear to me that the latest type SUN processor was a joke compared to
the PIIs.

No they were not slower than a 266Mhz PII would have been for me. And
my code had some things which now would do better at a 64 bits machine
but at that time a bit worse so i considered it equally fast to a PII
at 300Mhz.

But the PII processsor was already years old at that time, whereas the
brandnew SUN processor was only clocked 300Mhz!!!!!!!!

Each workstation (single cpu) was 5 times the price of a PII450 system.

Of course that PII450 couldn't be put in a 32 processor shared memory
system, which the SUN most likely can be put in.

The PII450 isn't hot swappable etcetera.

So if you really want to run an application which has been written for
a cluster, and then can put it at 8192 processors (which will never
be able to get used at the same time i bet. most likely you can at
most allocate 1000 processors or so for a single job).

In that case there is of course a use in having such a cluster.

But the speedup over a dual 1.4 MP will be most likely not
even close to a factor 1000.

Factor 100 perhaps?

Pay a programmer a bit and it's a factor 30 perhaps?

Now if this process runs for a week, then for research institutes there
is of course a big advantage, because you are 30 weeks faster!

You need 1 week instead of 30.

Obviously there is a use here to make a huge system, but i would be
pretty amazed if it's getting used like that.

Most likely 100 scientists kick on that they get 64 processors
from a 8192 processor machine!

The only real advantage on this machine is again for the meteorologists,
who can use big memory, bit storage, and big bandwidths.

But well. They don't need many processors. Just a huge RAM memory!

The bottom line is that compared to a 1.4Ghz MP, they already need
16 times more processors for each MP you would use!

If a scientist allocates 32 processors with an application that's only
needing processor power, then a dual 1.4 will be faster for them!

If they need its bandwidth, why then create a machine with so many
processors?

>>Still probably optimistic number of nodes a second.
>>So at 8192 processors, from which you can perhaps use a 1000 at a time,
>>I would get 15M nodes a second.

>>Now that looks great, but that's of course on a CLUSTER. Speedup perhaps
>>10%. 1.5M nodes a second effectively, but the bigger the depth the less
>>the speedup gets as the branching factor will be worse, unless i accept
>>that the thing first slows down at each processor (which is a likely
>>approach) and pray that the latency is more than fast at this thing.
>>
>>So you sure outsearch deep blue by many plies, but not if a new deep
>>blue would be pressed on a chip using nullmove and DDR-RAM at it.
>>
>>So you are not faster in NPS, but search improvements would let it
>>search deeper. that still wouldn't make my DIEP faster on this machine
>>than DB was in nodes a second.
>>
>>Of course DBs focus upon only getting the maximum number of NPS (that's
>>how they advertised the thing. search depths have no commercial value)
>>sure made it faster than what i would get on this machine.
>>
>>>Is this really so for those in the know with hardware and these types of
>>>machines?



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.