Author: Dann Corbit
Date: 10:24:19 07/08/05
Go up one level in this thread
On July 08, 2005 at 04:23:07, Vincent Diepeveen wrote: >On July 07, 2005 at 16:34:58, Dann Corbit wrote: > >[snip] > >your branching factor math with deep blue search depths has 1 major bug. >they start their search at 1 cpu. if you start at 8 ply and 4 ply of it is in >hardware (they always used depth=4 in hardware), then it is very hard last 4 ply >to put to work a few processors. > >At depth=12 ply it's far easier to put to work 480 processors. > >Do you understand? > >They innitially searched with 10 cpu's or so and at depth=12 it was >peeking >= 60 cpu's. > >I can show you an output from diep, however diep ran at a supercomputer with >latencies of around 5.8 us to read bytes from another cpu. that's one way >pingpong latency of 3-4 us. > >Deep blue's cluster had latencies far worse than that. > >I've been working for 2 years fulltime just to make a datastructure to put 480 >cpu's to work. When i finally ran at 512 cpu's, i used by the way initially 500 >cpu's and later on 460 cpu's, i discovered i had to do more rude splitting. > >YBW just doesn't work at 460 cpu's. It doesn't put them all to work. > >Now please remember this was at 12-20 ply search depths. > >At 8 ply search depth it was searching with at most 10 cpu's. > >It's very hard to measure with how many cpu's you've been searching. Just the >call to gather those statistics is already slowing down search too much. For >example a call to the timer, means all 460 cpu's must access 1 shared cpu. > >So if all 460 cpu's time their search, you can't run 460 cpu's anymore. > >In fact you can't run 120 cpu's anymore then, because just that time call from >120 cpu's for each search to the same global timer cpu will already completely >traffic jam the entire machine. > >That was the hard problem i had to live with. Above 32 cpu's i could not time >how effectively i put to work cpu's! I discovered this problem 9 months before >world champs. > >How about deep blue? They printed cpu's just 2 weeks before the match. > >During the match with kasparov, deep blue crashed many times. If i remember well >2 to 3 times in the first game already. Their parallel search >must've been real ugly. > >In fact they used the algorithm with the worst speedup possible. That's just >divide jobs at n-8 ply. > >So at 8 ply search depth at most 16 cpu's could have searched. At 4 ply search >depth the hardware asics were divided. > >At it's still at most 16 cpu's. > >At 10 ply, now we're talking. You can start dividing jobs, but how quickly does >it finish? > >Oh the branching factor will be UGLY. Such a dramatic increase in nps each >iteration, and still not better than a factor 4. > >Diep had this same problem at 460 processors with unpredicted moves at >depths 8-10. > >Yet with nullmove and with hashtable all plies except in qsearch and ideal >software move ordering, we're speaking that 8-10 ply doesn't eat too many nodes. Yes, you are probably correct. I do not know enough about the problems of parallel search to extrapolate single threaded results. I had not thought about that obvious complication.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.