Computer Chess Club Archives


Search

Terms

Messages

Subject: supercomputers and their future for gameplaying

Author: Vincent Diepeveen

Date: 19:42:17 12/22/03

Go up one level in this thread


On December 22, 2003 at 21:57:07, Vincent Diepeveen wrote:

Because my report will take a lot of time and will be very extensive it might be
interesting to share 1 thing which in fact amazed me a lot by the clearness of
the results.

That is that projects like chessbrain are completely doomed to work in realtime.
*completely* doomed.

I will show some data here.
The question i asked myself was next: it is trivial that cpu's get faster and
faster each year. Yet latency from processor to processor doesn't keep up. This
is a thing that's trivially shown by several researchers already. Under this
assumption it is trivial then that the hashtable will not be capable of getting
shared. So one must use a local hashtable last few plies. A very clear case of
when this has to happen is when one searches in a distributed project (like
chessbrain).

In diep there is a parameter what depth to store in global hashtable (and
retrieve) and what to store local. The tournament version used did store all
internal nodes into the hashtable, but the quiescencesearch was stored onto
First the normal output at the position:

Please remember i took a position which very soon showed itself as a worst case
position and the hashtables were cleaned when starting to search at it. That's
why i used it.

First the normal diep version:
        00:00      0 0 0 (460) 1 (0,0) 2.119 b4xc5
        00:00      0 0 0 (460) 2 (0,0) 2.119 b4xc5 Qc7xc5
        00:02      0 0 0 (460) 3 (0,0) 2.237 b4xc5 Ra8-e8 Nc3-e4 Bd6xc5
        00:02      0 0 0 (460) 4 (0,0) 2.237 b4xc5 Ra8-e8 Nc3-e4 Bd6xc5
        00:40      0 0 0 (460) 5 (0,0) 1.883 b4xc5 Rf8-e8 Nc3-e2 Qc7-a5 c2-c3
        00:45      0 0 0 (460) 6 (0,0) 1.846 b4xc5 Ra8-e8 Nc3-e4 Ne7-f5 Nf3-g5
Bd6xc5
        00:53      0 0 0 (460) 7 (0,0) 2.406 b4xc5 Rf8-e8 Nc3-e4 Ne7-f5 Nf3-g5
Qc7xc5 c2-c3
        00:56      0 0 0 (460) 8 (0,0) 1.739 b4xc5 Rf8-e8 Nc3-e4 Ne7-f5 Nf3-g5
Qc7xc5 Ra1-b1 Nf5-d4
        01:07      0 0 0 (460) 9 (0,0) 1.970 b4xc5 Rf8-e8 Nc3-e4 Ne7-f5 Nf3-g5
Qc7xc5 Qd1-d2 Nf5-d4 c2-c3
        01:34 563556 53024985 0 (460) 10 (15969,151439) 1.204 b4xc5 Rf8-e8
Nf3-g5 Ne7-f5 Nc3-e4 Bd6xc5 Qd1-g4 Nf5-d6 Ng5-e6 Re8xe6 d5xe6
depth=10 nps=607055 ttl=72846666 egtb=0 aborted=48692678
(afh=25543,splits=261134) time=120.00s

        03:08 758373 142589422 0 (460) 11 (47875,499297) 1.095 b4xc5 Rf8-e8
Nc3-e2 Qc7xc5 c2-c4 Ba6xc4 Ba2xc4 Qc5xc4 Qd1-d4 Qc4xd5 O-O
depth=11 nps=813654 ttl=182877066 egtb=0 aborted=112295422
(afh=67685,splits=693612) time=224.76s

        03:58 908254 216881978 0 (460) 12 (78112,841881) 1.050 b4xc5 Rf8-e8
Nc3-e2 Qc7xc5 c2-c4 Ba6xc4 Ba2xc4 Qc5xc4 Qd1-d4 Qc4-c2 Qd4-c3 Qc2-e4
depth=12 nps=957532 ttl=386182356 egtb=0 aborted=241106908
(afh=127633,splits=1400567) time=403.31s

        09:36 1298189 748172751 0 (460) 13 (245417,2398027) 0.444 b4xc5 Rf8-e8
Nc3-e2 Qc7xc5 c2-c4 Ne7xd5 Qd1-d4 Qc5xd4 Nf3xd4 Nd5-c3 Ke1-f2 Bd6-c5 Kf2-f3
depth=13 nps=3168138 ttl=3095778486 egtb=0 aborted=1759560355
(afh=1258955,splits=8177463) time=977.16s

        19:20 3364631 3906067861 0 (460) 14 (1603537,10670458) 0.117 b4xc5
Rf8-e8 Nc3-e2 Qc7xc5 c2-c4 Ne7xd5 Qd1-d4 Qc5xd4 Nf3xd4 Nd5-c3 a3-a4 f4-f3
        50:54 5317603 16243842977 0 (460) 14 (5403791,33550882) 0.212 b4-b5
Ba6-b7 Qd1-d2 Ne7-f5 O-O Nf5-e3 Rf1-f2 Rf8-e8 Qd2-d4 Ra8-c8 Bc1xe3 f4xe3 Rf2-e2
Nc5-d3
depth=14 nps=5015287 ttl=17096414590 egtb=0 aborted=9662055619
(afh=5723154,splits=36131262) time=3408.86s

        01:00:31 5064710 18392394781 0 (460) 15 (6077444,39451488) 0.264 b4-b5
Ba6-b7 Qd1-d2 Ne7-f5 O-O Nf5-e3 Rf1-f2 f7-f5 Qd2-d4 Ra8-e8 Ba2-c4 Nc5-e4 Bc1xe3
f4xe3 a3-a4
depth=15 nps=4294115 ttl=20628201297 egtb=0 aborted=11380590805
(afh=6773788,splits=45841476) time=4803.83s

        01:31:30 4466087 24521993032 0 (460) 16 (7723311,52786168) 0.160 b4-b5
Ba6-b7 Qd1-d2 Ne7-f5 O-O Nf5-e3 Rf1-f2 f7-f5 Nf3-g5 Qc7-b6 Qd2-d4 h7-h6 Ng5-e6
Nc5xe6 Qd4xb6 a7xb6
        01:37:35 4518356 26456741081 0 (460) 16 (8419398,57816187) 0.165 b4xc5
Rf8-e8 Nc3-e2 Qc7xc5 c2-c4 Ne7xd5 Qd1-d4 Qc5xd4 Nf3xd4 Nd5-c3 Ke1-f1 Nc3xa2
Ra1xa2 Ba6xc4 Nd4-f5 Bc4xa2
depth=16 nps=4454773 ttl=32181820283 egtb=0 aborted=16950269941
(afh=10504865,splits=74699573) time=7224.12s

        02:22:43 5110928 43766052604 0 (460) 17 (13940238,91131639) 0.609 b4xc5
Rf8-e8 Nc3-e2 Bd6xc5 d5-d6 Bc5xd6 O-O Ne7-f5 Ne2-d4 Qc7-c5 c2-c4 Ra8-d8 Kg1-h1
Bd6-e5 Nf3xe5 Qc5xe5 Qd1-g4
depth=17 nps=4860131 ttl=49287275195 egtb=0 aborted=25516239384
(afh=15928214,splits=109235230) time=10141.14s

        03:22:12 5265541 63882491771 0 (460) 18 (19968586,133598028) 0.466 b4xc5
Rf8-e8 Nc3-e2 Qc7xc5 c2-c4 Ne7xd5 Qd1-d4 Qc5xd4 Nf3xd4 Nd5-c3 Ke1-f1 Nc3xa2
Ra1xa2 f4-f3 Nd4xf3 Ba6xc4 Ra2-c2 Bc4-d3
depth=18 nps=5272397 ttl=83276304020 egtb=0 aborted=42072953528
(afh=26561999,splits=181805307) time=15794.77s

        05:47:57 5993449 125127338136 0 (460) 19 (37271002,235161437) 0.293
b4xc5 Rf8-e8 Nc3-e2 Bd6xc5 d5-d6 Bc5xd6 O-O Ne7-f5 Ne2-d4 Nf5xd4 Nf3xd4 Ba6xf1
Kg1xf1 Ra8-d8 Bc1-b2 Bd6xa3 Ba2xf7 Kg8-h8 Ra1xa3
depth=19 nps=6361727 ttl=200869193368 egtb=0 aborted=98884061155
(afh=62657776,splits=376446069) time=31574.63s

        11:05:57 6941105 277350209487 0 (460) 20 (81637568,463819222) 0.228
b4xc5 Rf8-e8 Nc3-e2 Bd6xc5 d5-d6 Bc5xd6 O-O Ne7-f5 Ne2-d4 Nf5xd4 Nf3xd4 Ba6xf1
Kg1xf1 Ra8-d8 Ra1-b1 Bd6-c5 c2-c3 Bc5-d6 Qd1-f3 Qc7-e7

In the above output we see that 20 plies get reached after 11 hours. This is not
so important though. 10 plies take 94 seconds to reach. the first 9 ply the node
counts and such are not globally collected. They kept stored local. They do get
counted at 10 ply though. So the numbers at 10 ply and further are correct.
Usually first 9 ply get from hashtable or go so fast that there is no need to
bug 460 processors with the data collection which takes several milliseconds a
ply.

Now we change the parameter. Last 10 ply get stored only in local hashtable
(including qsearch of course) and the depths above 10 ply get stored in global
hashtable.

Trivially we just count hashtable. Splitting and such still gets done in the
same manner. Diep doesn't use hashtable properties to split with so many
hashtables (only at small number of cpu's it does), so in reality the results we
get back are a lot worse because of the extra time it takes to split the search
tree to the different processors. Yet even then the next result completely
shocked me too:

        00:00      0 0 0 (460) 1 (0,0) 2.119 b4xc5 Qc7xc5
        00:00      0 0 0 (460) 2 (0,0) 2.119 b4xc5 Qc7xc5
        00:02      0 0 0 (460) 3 (0,0) 2.237 b4xc5 Ra8-e8 Nc3-e4 Bd6xc5
        00:02      0 0 0 (460) 4 (0,0) 2.237 b4xc5 Ra8-e8
        00:14      0 0 0 (460) 5 (0,0) 1.883 b4xc5 Rf8-e8 Nc3-e2
        00:27      0 0 0 (460) 6 (0,0) 1.846 b4xc5 Ra8-e8
        00:36      0 0 0 (460) 7 (0,0) 2.406 b4xc5 Ne7-f5
        01:04      0 0 0 (460) 8 (0,0) 1.824 b4xc5 Bd6xc5
        03:25      0 0 0 (460) 9 (0,0) 1.970 b4xc5 Rf8-e8 Nc3-e4
        08:38 1222567 633607672 0 (460) 10 (148568,1326106) 0.632 b4xc5 Ra8-e8
Nc3-e4
depth=10 nps=1248100 ttl=725720560 egtb=0 aborted=348576875
(afh=194021,splits=1732467) time=581.46s

        58:45 1540704 5431214025 0 (460) 11 (889614,7134097) 1.110 b4xc5 Qc7xc5
depth=11 nps=1552287 ttl=5519793255 egtb=0 aborted=2338418994
(afh=940275,splits=7579001) time=3555.91s

        01:14:21 1567045 6991421375 0 (460) 12 (1415222,10794611) 1.065 b4xc5
depth=12 nps=1592326 ttl=7402982085 egtb=0 aborted=3454686069
(afh=1595556,splits=12242496) time=4649.16s

        02:51:20 1859415 19115107780 0 (460) 13 (2874460,21811050) 0.444 b4xc5
depth=13 nps=2204845 ttl=24139966284 egtb=0 aborted=10848686727
(afh=4834769,splits=32419200) time=10948.60s

Without global hashtable you're dead simply! A difference of 5 ply in search
depth. When processors get faster and latencies do not keep up with it, you're
dead simply!

Even during the tournament where a lot of interesting things happened, when i
saw this result i was completely shocked. I realized that all distributed
project plans i had i had to abandon at once. In fact i *did* have the plan to
make a distributed diep version.

But the result is just *too* clear. It will never play a realtime game!
5 ply, with all respect, that's even with a branching factor of 3.0 like

3^5 = 243. Starting at a loss of a factor 243 you're history simply. Trivially
we are talking about an exponential effect here.

Now this triggered the next experiment i did. Trivially when using more modern
processors than are in this supercomputer thing, you already have a similar
problem like above, but perhaps you can get away by not storing into hashtable
last 2 ply into hashtable.

Note that the latency this supercomputer has (5.8 microseconds on average to
just get 8 bytes from another processor), it is already dubious to store
everything into hashtables. So let's not store depthleft <= 2 into hashtables
and then search with 460 processors:

        00:00      0 0 0 (460) 1 (0,0) 2.119 b4xc5 Qc7xc5
        00:00      0 0 0 (460) 2 (0,0) 2.119 b4xc5 Qc7xc5
        00:02      0 0 0 (460) 3 (0,0) 2.237 b4xc5 Ra8-e8 Nc3-e4 Bd6xc5
        00:02      0 0 0 (460) 4 (0,0) 2.237 b4xc5
        00:20      0 0 0 (460) 5 (0,0) 1.883 b4xc5
        00:52      0 0 0 (460) 6 (0,0) 1.846 b4xc5
        01:16      0 0 0 (460) 7 (0,0) 2.406 b4xc5
        01:27      0 0 0 (460) 8 (0,0) 1.824 b4xc5
        05:17      0 0 0 (460) 9 (0,0) 1.970 b4xc5
        05:44 244602 84307066 0 (460) 10 (17774,187986) 1.204 b4xc5 Qc7xc5
depth=10 nps=295935 ttl=107235208 egtb=0 aborted=60058340
(afh=28985,splits=303215) time=362.36s

        06:51 382970 157779897 0 (460) 11 (48121,494174) 1.095 b4xc5
depth=11 nps=442290 ttl=193360776 egtb=0 aborted=114036575
(afh=69008,splits=680388) time=437.18s

        07:35 551816 251313751 0 (460) 12 (95107,914561) 1.050 b4xc5 Qc7xc5
depth=12 nps=749278 ttl=452773756 egtb=0 aborted=287457319
(afh=169753,splits=1603527) time=604.28s

        11:33 1344027 932002639 0 (460) 13 (351281,2939548) 0.444 b4xc5 Qc7xc5
depth=13 nps=3324907 ttl=3753055869 egtb=0 aborted=2206079184
(afh=1475329,splits=8898360) time=1128.77s

        21:06 3722247 4714375656 0 (460) 14 (1891633,11348602) 0.117 b4xc5
        43:53 5137463 13527660552 0 (460) 14 (5270814,30097544) 0.182 b4-b5
depth=14 nps=4905888 ttl=14334219942 egtb=0 aborted=8253355779
(afh=5601330,splits=32690541) time=2921.84s

        57:16 5159575 17732532620 0 (460) 15 (6859755,40484655) 0.351 b4-b5
depth=15 nps=4679243 ttl=19836809206 egtb=0 aborted=11352746534
(afh=7505339,splits=45855978) time=4239.32s

        01:22:45 4792391 23797436449 0 (460) 16 (8718381,52815992) 0.160 b4-b5
        01:29:42 5004482 26937627088 0 (460) 16 (9865863,59231902) 0.165 b4xc5
depth=16 nps=5022682 ttl=33629317406 egtb=0 aborted=18947935034
(afh=12566091,splits=78163568) time=6695.49s

        02:31:09 5371057 48713667924 0 (460) 17 (17430262,102209832) 0.615 b4xc5
depth=17 nps=5133183 ttl=54610551153 egtb=0 aborted=30461988599
(afh=19878164,splits=120558181) time=10638.73s

        03:38:29 5527223 72457973813 0 (460) 18 (25186378,148399656) 0.466 b4xc5
depth=18 nps=5556667 ttl=92296420576 egtb=0 aborted=50576134199
(afh=33264389,splits=201240456) time=16610.03s

        06:20:37 6535370 149251520232 0 (460) 19 (47705397,264107777) 0.293
b4xc5
depth=19 nps=6963743 ttl=247859477324 egtb=0 aborted=131087487407
(afh=83002586,splits=441367183) time=35592.85s

        12:16:54 7458129 329754401671 0 (460) 20 (104829805,537922298) 0.228
b4xc5

In the long run it is amazingly scaling just as well, if not better than the
original run. However, the problem is very easy to see. Just look at the time of
10 ply. From 1:34 it goes to 5:44.

The problem is therefore very trivial. Getting the cpu's to work quickly is a
major problem when you do not store last few plies into global hashtable.

The search tree is simply getting more unpredictable and more unbalanced, which
is a real major problem. So searching at big clusters with fast opteron cpu's
for example, it will be real hard.

I've been working for years in the parallel field and already achieved more than
anyone before in parallel search just looking to how diep scales at this machine
when compared to the dinosaurs before me.

But when the trees get even more unbalanced than they nowadays already are
because of nullmove and a good working global hashtable, then problems really
get massive for future game tree researchers at massive parallel systems.

Not so much the speedup after an hour or 10 is the problem, but basically
getting a realtime scaling will be the real problem.

To me it looks like a very difficult task. The problem lies in the first so many
plies and solutions are not trivial. solutions will be like starting first cpu's
at nearby cpu's. Then cpu's that are at 1 router distance and like that.

Making a game tree searching program like that is *really* complicated.

Of course i'm speaking about 100+ cpu's here. Up to a cpu or 64 i see so far no
problems.

So from my viewpoint the chessbrain project is doomed to play games where they
get a better speedup than a quad xeon/opteron will deliver.

In fact the results have shocked me that much that i have abandonned all project
plans to convert diep to MPI too. As it seems now, diep will remain shared
memory forever. Even though the conversion to MPI is pretty trivial from the
current datastructure, i simply see no point in doing it!

In a few years when cpu's are a lot faster than they are now, getting to work
fast cpu's with x microsecond latencies is asking for trouble!

We are however in the sure know that in a few years when cpu's are X times
faster than they are now, that the network latencies from the fastest network
cards (myrilnet) will still be around 5 us (right now 10 us) to get a single
position.

Not to mention the even bigger problem and that's effeciently splitting all
cpu's and getting them to work in such a way that your branching factor doesn't
suffer.

Search of chessprograms in 2003 are having a too good branching factor simply to
let a huge cluster (all supercomputers are basically clusters when having a huge
number of cpu's; even the T3E's latencies with opterons get expressed in
microseconds there) compensate for that by raw speed.

Additionally those clusters/supercomputers carry processors which are x times
slower than the fastest pc processors. So you always start at another
disadvantage, which some years ago was simply not there.

>On December 21, 2003 at 20:51:19, Bob Durrett wrote:
>
>i am writing a huge technical report for the government about diep at the
>supercomputer. because 512 processors is so much faster than 1, it takes a huge
>time to calculate the speedups.
>
>practical spoken it was worst case about 2 ply deeper in middlegame roughly than
>a fast pc would have given me (dual opteron). Considering 1 cpu was already 10
>times slower practically than a pc processor is, that is pretty good.
>
>Some positions it performed real well. Peek was 9.9 mln nps, for diep this is
>very good. a single processor gives in opening 22k nps and it scaled about
>40-60% at the supercomputer sequential (so practically losing from that 22k nps
>about 11k nps).
>
>And that at tournament time control, which nowadays is a lot quicker than some
>years ago, i do not consider a bad thing.
>
>Zugzwang and similar programs like cilkchess ran at similar machines and had
>despite having a smaller evaluation function, a lot less nps at such machines.
>
>Zugzwang at a faster latency T3E with 512 alpha's 500Mhz (that's what memory
>tells me, all these guys are good in NOT reacting onto emails i ship them)
>peeked around 2 million nodes a second, i remember it very well.
>
>Diep's logfiles just like the huge report i'm writing will be public available.
>That means that everyone who asks after them will get it. The logfiles you can
>already get if you want to.
>
>The testresults i had will be in the report and the entire outputs are publicly
>available too.
>
>The thing missing is single cpu runs to recalculate testresults which were
>obtained by overnight runs.
>
>One of my conclusions will surprise some researchers (not all of them) and the
>government will dislike it. So it will be carefully written down.
>
>Best regards,
>Vincent
>
>> Hi Vincent.
>>
>>Have you had a chance to analyze the GRAZ performance data for your
>>chess-playing program yet?
>>
>>What can you tell us about your technical findings?  Did you achieve the goals
>>you had set for the GRAZ event?
>>
>>Someone said they thought you did not have the opportunity to test your software
>>on the full-up computer hardware prior to the GRAZ event.  If so, did that have
>>an impact?
>>
>>Your findings should shed light on the technical aspects of running chess
>>programs on very large multiprocessor computers [>>8 processors].
>>
>>Bob D.



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.