Author: Ernst A. Heinz
Date: 08:11:43 11/28/98
Go up one level in this thread
On November 24, 1998 at 17:17:07, Robert Hyatt wrote: > > [...] >> >>Bob, >> >>AFAIK your 30% overhead is only a good average approximation for lowly parallel >>searchers on SMPs with *physically* shared hash tables. For massively parallel >>searchers on machines with *physically* distributed memory I have not yet seen >>any experimental data that *conclusively* supports such high parallel >>efficiency. To the contrary, the only frank publications in this respect seem >>to be the articles by the "StarTech" and "StarSocrates" groups who admit to >>something like an application speedup of only 50-60 on a CM-5 with 512 CPUs >>which translates to a parallel efficiency of 10%-15% for their Jamboree search. >>Most other researchers who reported higher relative speedups for their >>massively parallel implementations on distributed-memory machines either failed >>to account for the increases in hash-table sizes or used horribly inefficient >>sequential implementations as their point of reference. >> >>=Ernst= > >This isn't really an issue about 'shared hash tables'. Yes, sorry, this was an obvious typo. :-( I meant *physically* shared memory which allows for efficient scheduling and sharing of work between the parallel processors (e.g. your DTS algorithm). [BTW, physically shared hash tables also improve the efficiency of these shared implementations -- primarily because of better move ordering.] >Hash tables don't >give a factor of 2 in the middlegame based on results I've gotten. This is >about "process granularity". The *Socrates machine uses a message-passing >protocol that is inherently slow. I've run on such machines (IE the CM-5 >for one, the T3D/E for another) and this causes serious problems. The >big + for shared memory is instant communication. so that threads can >share information without regard to "cost". Right (see my comments above). >The DB machine doesn't suffer from the huge CM-5 type cost, they only use >16 (or 32) cpus, and each CPU talks to the chess processors at bus speeds, not >at >2microseconds/message or whatever as in the CM and other architectures. In >fact, the DB (last edition) chess processors didn't transposition tables, only >the search done on the SP did. I agree to your argumentation as for the communication between the chess processors and the host CPUs of the SP. I remember somebody of the DB team mention that the limiting factors of this communication link were *not* the chess processors but in fact the host CPUs. So far, so good. Yet, the SP itself looks like an extremely poor machine for parallel chess because it is essentially a cluster of workstations coupled by a special interconnect that features nice throughput but horrible latency AFAIK. With respect to latency the interconnect of the SP is *much worse* than those of the CM-5 and the T3D/T3E. As you and I already said it is extremely hard to squeeze acceptable parallel alpha-beta search performance out of such high-latency distributed-memory machines -- even if you "only" use 32 CPUs. ==> As I already proclaimed several times before, not the chess processors but rather the 32-node SP host machine looks like the obvious bottleneck of "Deep(er) Blue". I wonder if and how the "Deep(er) Blue" team succeeded in achieving more than 30% parallel search efficiency on the SP. =Ernst=
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.