Author: Vincent Diepeveen
Date: 01:23:07 07/08/05
Go up one level in this thread
On July 07, 2005 at 16:34:58, Dann Corbit wrote: [snip] your branching factor math with deep blue search depths has 1 major bug. they start their search at 1 cpu. if you start at 8 ply and 4 ply of it is in hardware (they always used depth=4 in hardware), then it is very hard last 4 ply to put to work a few processors. At depth=12 ply it's far easier to put to work 480 processors. Do you understand? They innitially searched with 10 cpu's or so and at depth=12 it was peeking >= 60 cpu's. I can show you an output from diep, however diep ran at a supercomputer with latencies of around 5.8 us to read bytes from another cpu. that's one way pingpong latency of 3-4 us. Deep blue's cluster had latencies far worse than that. I've been working for 2 years fulltime just to make a datastructure to put 480 cpu's to work. When i finally ran at 512 cpu's, i used by the way initially 500 cpu's and later on 460 cpu's, i discovered i had to do more rude splitting. YBW just doesn't work at 460 cpu's. It doesn't put them all to work. Now please remember this was at 12-20 ply search depths. At 8 ply search depth it was searching with at most 10 cpu's. It's very hard to measure with how many cpu's you've been searching. Just the call to gather those statistics is already slowing down search too much. For example a call to the timer, means all 460 cpu's must access 1 shared cpu. So if all 460 cpu's time their search, you can't run 460 cpu's anymore. In fact you can't run 120 cpu's anymore then, because just that time call from 120 cpu's for each search to the same global timer cpu will already completely traffic jam the entire machine. That was the hard problem i had to live with. Above 32 cpu's i could not time how effectively i put to work cpu's! I discovered this problem 9 months before world champs. How about deep blue? They printed cpu's just 2 weeks before the match. During the match with kasparov, deep blue crashed many times. If i remember well 2 to 3 times in the first game already. Their parallel search must've been real ugly. In fact they used the algorithm with the worst speedup possible. That's just divide jobs at n-8 ply. So at 8 ply search depth at most 16 cpu's could have searched. At 4 ply search depth the hardware asics were divided. At it's still at most 16 cpu's. At 10 ply, now we're talking. You can start dividing jobs, but how quickly does it finish? Oh the branching factor will be UGLY. Such a dramatic increase in nps each iteration, and still not better than a factor 4. Diep had this same problem at 460 processors with unpredicted moves at depths 8-10. Yet with nullmove and with hashtable all plies except in qsearch and ideal software move ordering, we're speaking that 8-10 ply doesn't eat too many nodes. Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.