Author: Robert Hyatt
Date: 14:12:20 11/28/98
Go up one level in this thread
On November 28, 1998 at 11:11:43, Ernst A. Heinz wrote: >On November 24, 1998 at 17:17:07, Robert Hyatt wrote: >> >> [...] >>> >>>Bob, >>> >>>AFAIK your 30% overhead is only a good average approximation for lowly parallel >>>searchers on SMPs with *physically* shared hash tables. For massively parallel >>>searchers on machines with *physically* distributed memory I have not yet seen >>>any experimental data that *conclusively* supports such high parallel >>>efficiency. To the contrary, the only frank publications in this respect seem >>>to be the articles by the "StarTech" and "StarSocrates" groups who admit to >>>something like an application speedup of only 50-60 on a CM-5 with 512 CPUs >>>which translates to a parallel efficiency of 10%-15% for their Jamboree search. >>>Most other researchers who reported higher relative speedups for their >>>massively parallel implementations on distributed-memory machines either failed >>>to account for the increases in hash-table sizes or used horribly inefficient >>>sequential implementations as their point of reference. >>> >>>=Ernst= >> >>This isn't really an issue about 'shared hash tables'. > >Yes, sorry, this was an obvious typo. :-( > >I meant *physically* shared memory which allows for efficient scheduling and >sharing of work between the parallel processors (e.g. your DTS algorithm). > >[BTW, physically shared hash tables also improve the efficiency of these shared >implementations -- primarily because of better move ordering.] > >>Hash tables don't >>give a factor of 2 in the middlegame based on results I've gotten. This is >>about "process granularity". The *Socrates machine uses a message-passing >>protocol that is inherently slow. I've run on such machines (IE the CM-5 >>for one, the T3D/E for another) and this causes serious problems. The >>big + for shared memory is instant communication. so that threads can >>share information without regard to "cost". > >Right (see my comments above). > >>The DB machine doesn't suffer from the huge CM-5 type cost, they only use >>16 (or 32) cpus, and each CPU talks to the chess processors at bus speeds, not >>at >>2microseconds/message or whatever as in the CM and other architectures. In >>fact, the DB (last edition) chess processors didn't transposition tables, only >>the search done on the SP did. > >I agree to your argumentation as for the communication between the chess >processors and the host CPUs of the SP. I remember somebody of the DB team >mention that the limiting factors of this communication link were *not* the >chess processors but in fact the host CPUs. > >So far, so good. Yet, the SP itself looks like an extremely poor machine for >parallel chess because it is essentially a cluster of workstations coupled by >a special interconnect that features nice throughput but horrible latency AFAIK. >With respect to latency the interconnect of the SP is *much worse* than those >of the CM-5 and the T3D/T3E. As you and I already said it is extremely hard to >squeeze acceptable parallel alpha-beta search performance out of such >high-latency distributed-memory machines -- even if you "only" use 32 CPUs. > >==> As I already proclaimed several times before, not the chess processors but > rather the 32-node SP host machine looks like the obvious bottleneck of > "Deep(er) Blue". > >I wonder if and how the "Deep(er) Blue" team succeeded in achieving more than >30% parallel search efficiency on the SP. > >=Ernst= This is a messy design, to be sure... but if you understand how the chess processors fit into things, you understand that they essentially have the chess processors searching a fixed depth, and that this depth is directly set by the speed and number of chess processors. So the parallel search on the SP only occurs in the first N plies anyway, which means granularity is not a big issue. The only number I don't know accurately and won't until I remember to ask Hsu is what the 250M nodes per second represents. IE one quote had them using 512 chess processors which would max out at over 1B nodes per second. Hsu used to claim they could drive the chess processors at about 70% efficiency just keeping them fed 70% of the time, because of the slight mismatch between the chess processors and the base SP processor speeds. 1:1 would be very unlikely, and under 25% would be difficult too as they would simply let the chess processors search another ply deeper to slow them down. The only real number I can quote is mine... on crafty and/or Cray Blitz. Otherwise, for Deep Blue all I can say is "fast as hell" based on experience. :)
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.