Computer Chess Club Archives


Search

Terms

Messages

Subject: deep blue parallel splitting and branching factor

Author: Vincent Diepeveen

Date: 01:23:07 07/08/05

Go up one level in this thread


On July 07, 2005 at 16:34:58, Dann Corbit wrote:

[snip]

your branching factor math with deep blue search depths has 1 major bug.
they start their search at 1 cpu. if you start at 8 ply and 4 ply of it is in
hardware (they always used depth=4 in hardware), then it is very hard last 4 ply
to put to work a few processors.

At depth=12 ply it's far easier to put to work 480 processors.

Do you understand?

They innitially searched with 10 cpu's or so and at depth=12 it was
peeking >= 60 cpu's.

I can show you an output from diep, however diep ran at a supercomputer with
latencies of around 5.8 us to read bytes from another cpu. that's one way
pingpong latency of 3-4 us.

Deep blue's cluster had latencies far worse than that.

I've been working for 2 years fulltime just to make a datastructure to put 480
cpu's to work. When i finally ran at 512 cpu's, i used by the way initially 500
cpu's and later on 460 cpu's, i discovered i had to do more rude splitting.

YBW just doesn't work at 460 cpu's. It doesn't put them all to work.

Now please remember this was at 12-20 ply search depths.

At 8 ply search depth it was searching with at most 10 cpu's.

It's very hard to measure with how many cpu's you've been searching. Just the
call to gather those statistics is already slowing down search too much. For
example a call to the timer, means all 460 cpu's must access 1 shared cpu.

So if all 460 cpu's time their search, you can't run 460 cpu's anymore.

In fact you can't run 120 cpu's anymore then, because just that time call from
120 cpu's for each search to the same global timer cpu will already completely
traffic jam the entire machine.

That was the hard problem i had to live with. Above 32 cpu's i could not time
how effectively i put to work cpu's! I discovered this problem 9 months before
world champs.

How about deep blue? They printed cpu's just 2 weeks before the match.

During the match with kasparov, deep blue crashed many times. If i remember well
2 to 3 times in the first game already. Their parallel search
must've been real ugly.

In fact they used the algorithm with the worst speedup possible. That's just
divide jobs at n-8 ply.

So at 8 ply search depth at most 16 cpu's could have searched. At 4 ply search
depth the hardware asics were divided.

At it's still at most 16 cpu's.

At 10 ply, now we're talking. You can start dividing jobs, but how quickly does
it finish?

Oh the branching factor will be UGLY. Such a dramatic increase in nps each
iteration, and still not better than a factor 4.

Diep had this same problem at 460 processors with unpredicted moves at
depths 8-10.

Yet with nullmove and with hashtable all plies except in qsearch and ideal
software move ordering, we're speaking that 8-10 ply doesn't eat too many nodes.

Vincent



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.