Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: deep blue parallel splitting and branching factor

Author: Dann Corbit
Date: 10:24:19 07/08/05
On July 08, 2005 at 04:23:07, Vincent Diepeveen wrote:

>On July 07, 2005 at 16:34:58, Dann Corbit wrote:
>
>[snip]
>
>your branching factor math with deep blue search depths has 1 major bug.
>they start their search at 1 cpu. if you start at 8 ply and 4 ply of it is in
>hardware (they always used depth=4 in hardware), then it is very hard last 4 ply
>to put to work a few processors.
>
>At depth=12 ply it's far easier to put to work 480 processors.
>
>Do you understand?
>
>They innitially searched with 10 cpu's or so and at depth=12 it was
>peeking >= 60 cpu's.
>
>I can show you an output from diep, however diep ran at a supercomputer with
>latencies of around 5.8 us to read bytes from another cpu. that's one way
>pingpong latency of 3-4 us.
>
>Deep blue's cluster had latencies far worse than that.
>
>I've been working for 2 years fulltime just to make a datastructure to put 480
>cpu's to work. When i finally ran at 512 cpu's, i used by the way initially 500
>cpu's and later on 460 cpu's, i discovered i had to do more rude splitting.
>
>YBW just doesn't work at 460 cpu's. It doesn't put them all to work.
>
>Now please remember this was at 12-20 ply search depths.
>
>At 8 ply search depth it was searching with at most 10 cpu's.
>
>It's very hard to measure with how many cpu's you've been searching. Just the
>call to gather those statistics is already slowing down search too much. For
>example a call to the timer, means all 460 cpu's must access 1 shared cpu.
>
>So if all 460 cpu's time their search, you can't run 460 cpu's anymore.
>
>In fact you can't run 120 cpu's anymore then, because just that time call from
>120 cpu's for each search to the same global timer cpu will already completely
>traffic jam the entire machine.
>
>That was the hard problem i had to live with. Above 32 cpu's i could not time
>how effectively i put to work cpu's! I discovered this problem 9 months before
>world champs.
>
>How about deep blue? They printed cpu's just 2 weeks before the match.
>
>During the match with kasparov, deep blue crashed many times. If i remember well
>2 to 3 times in the first game already. Their parallel search
>must've been real ugly.
>
>In fact they used the algorithm with the worst speedup possible. That's just
>divide jobs at n-8 ply.
>
>So at 8 ply search depth at most 16 cpu's could have searched. At 4 ply search
>depth the hardware asics were divided.
>
>At it's still at most 16 cpu's.
>
>At 10 ply, now we're talking. You can start dividing jobs, but how quickly does
>it finish?
>
>Oh the branching factor will be UGLY. Such a dramatic increase in nps each
>iteration, and still not better than a factor 4.
>
>Diep had this same problem at 460 processors with unpredicted moves at
>depths 8-10.
>
>Yet with nullmove and with hashtable all plies except in qsearch and ideal
>software move ordering, we're speaking that 8-10 ply doesn't eat too many nodes.

Yes, you are probably correct.  I do not know enough about the problems of
parallel search to extrapolate single threaded results.

I had not thought about that obvious complication.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.