Author: Robert Hyatt
Date: 13:53:30 09/18/01
Go up one level in this thread
On September 18, 2001 at 14:45:45, Vincent Diepeveen wrote: >On September 18, 2001 at 12:46:23, Robert Hyatt wrote: > >>On September 18, 2001 at 11:54:45, Vincent Diepeveen wrote: >> >>>On September 18, 2001 at 11:07:27, Gian-Carlo Pascutto wrote: >>> >>>>On September 18, 2001 at 09:10:02, Vincent Diepeveen wrote: >>>> >>>>>Well if no one here manages to do that, who am i to say that the >>>>>remainder of this algorithm is worth trying? >>>> >>>>Perhaps just noone bothered? >>>> >>>>>Which means that APHID already says who has to search what before relevance >>>>>of parallel splitting has been indicated. Considering that nowadays we >>>>>use nullmove bigtime, this makes APHID completely outdated, because >>>>>it in short doesn't wait at all! >>>> >>>>They Crafty they used had nullmove R=2. Nowadays most people (I know >>>>you don't, but you aren't everybody even if you continously think so) >>>>use R=2/R=3. I'm pretty sure that isn't going to make the difference >>>>between usefull and useless. >>> >>>All testresults they published are based upon 5 to 8 ply searches >>>and based upon a program called TheTurk. No mention of crafty speedups, >>>and if they are then it's still 5 to 8 ply. >>> >>>Who cares for 5 to 8 ply speedups in the year 2001? >>> >>>My speedup is horrible at 7 ply, no kidding, but look how my speedup is >>>at bigger depths! >> >>My speedup is no different at depth=7 than it is at depth=10. It is a bit >>more efficient at deeper depths, but not 2x. Or 1.5x even. > > > >>Here's some data: > >>searched to depth=9. I can't really time depth=7 >>searches as they take a few hundredths of a second and that >>is not very accurate. > >>1cpu: 2.80 >>4cpu: 1.02 > >So that's for Bxe4 trick about 2.75 > >>1cpu: 3.92 >>4cpu: 1.39 > >So for f6 it is 2.8 > >>Now for 12 plies: >> >>1cpu: 87.00 secs >>4cpu: 21.08 seconds > >So now it is for Bxe4 trick about 4.0 speedup > >This is a huge difference! > >>1cpu: 100.00 secs >>4cpu: 40.17 secs > >Now it is 2.5 but we do not know in either of the both cases your >search output, so this is not very meaningful data. if your root move >flipped a time more here or if your score changed this can have >devastating influence at speedup! > >I did the same test for DIEP at Bxe4 position, note that i only have >2 processors at most at home so i can't try 4 regrettably but here the >output first for single cpu: > >00:00 0 0 13 (0) 1 -1.661 Bb7xe4 Bd3xe4 Nf6xe4 Qe2xe4 Qc7xc4 Qe4xc4 Rc8xc4 >00:00 0 0 26 (0) 1 0.269 Rf8-e8 >00:00 0 0 31 (0) 1 0.334 Nd7-c5 Na4xc5 Qc7xc5 >00:00 0 0 114 (0) 2 0.334 Nd7-c5 Na4xc5 Qc7xc5 >00:00 0 0 411 (0) 3 0.334 Nd7-c5 Na4xc5 Qc7xc5 >00:00 0 0 1187 (0) 4 0.013 Nd7-c5 Na4xc5 Qc7xc5 Bb2-d4 >++ f8-e8 >00:00 0 0 3227 (0) 4 0.167 Rf8-e8 Bb2-d4 Re8-d8 Ra1-c1 >++ f8-d8 >00:00 0 0 4483 (0) 4 0.286 Rf8-d8 Bb2-d4 Nd7-c5 Na4xc5 d6xc5 >00:00 0 0 7216 (0) 5 0.286 Rf8-d8 Bb2-d4 Nd7-c5 Na4xc5 d6xc5 >00:00 0 0 18186 (0) 6 0.216 Rf8-d8 Bb2-d4 Kg8-f8 g2-g3 Nd7-c5 Na4xc5 d6xc5 >00:02 0 0 95675 (0) 7 0.221 Rf8-d8 Bb2-d4 Kg8-f8 g2-g3 Kf8-g8 Ra1-b1 Nd7-c5 Na4x >c5 d6xc5 >++ b7-e4 >00:02 0 0 132048 (0) 7 0.765 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Bb2xf6 Be7xf6 Be >4-c6 Rc4xa4 Bc6xd7 Bf6xa1 Rd1xa1 >00:03 0 0 181641 (0) 8 1.006 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Na4xb6 Rc4xe4 Bb >2-d4 >00:05 0 0 334446 (0) 9 1.055 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Rd1-d4 Rc4xd4 Bb >2xd4 Nf6xe4 Na4xb6 >00:09 0 0 595985 (0) 10 0.661 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Na4xb6 Rc4xe4 N >b6xd7 Nf6xd7 Bb2-c3 >00:21 0 0 1479280 (0) 11 1.181 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Na4xb6 Rc4xe4 >Nb6xd7 Nf6xd7 a3-a4 Rf8-c8 >00:36 0 0 2616558 (0) 12 0.886 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Na4xb6 Rc4xe4 >Nb6xd7 Nf6xd7 Nf3-d2 Re4-a4 g2-g3 >01:32 0 0 6542438 (0) 13 1.072 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Na4xb6 Rc4xe4 >Nb6xd7 Nf6xd7 Nf3-d4 Be7-f6 a3-a4 d6-d5 >02:54 0 0 12694811 (0) 14 0.724 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Na4xb6 Rc4xe4 > Nb6xd7 Nf6xd7 Nf3-d4 Be7-f6 f2-f3 Re4-e3 a3-a4 >16:25 0 0 57592380 (0) 15 0.949 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Na4xb6 Rc4xe4 > Nb6xd7 Nf6xd7 Nf3-d4 Be7-f6 f2-f3 Re4-e3 Kg1-f2 Bf6-g5 > >Now the parallel version: >00:00 0 0 13 (0) 1 -1.661 Bb7xe4 Bd3xe4 Nf6xe4 Qe2xe4 Qc7xc4 Qe4xc4 Rc8xc4 >00:00 0 0 26 (0) 1 0.269 Rf8-e8 >00:00 0 0 31 (0) 1 0.334 Nd7-c5 Na4xc5 Qc7xc5 >00:00 0 0 109 (0) 2 0.334 Nd7-c5 Na4xc5 Qc7xc5 >00:00 0 0 406 (0) 3 0.334 Nd7-c5 Na4xc5 Qc7xc5 >00:00 0 0 1182 (0) 4 0.013 Nd7-c5 Na4xc5 Qc7xc5 Bb2-d4 >++ f8-e8 >00:00 0 0 3122 (717) 4 0.167 Rf8-e8 Bb2-d4 Re8-d8 Ra1-c1 >++ f8-d8 >00:00 0 0 4412 (856) 4 0.286 Rf8-d8 Bb2-d4 Nd7-c5 Na4xc5 d6xc5 >00:00 0 0 7150 (856) 5 0.286 Rf8-d8 Bb2-d4 Nd7-c5 Na4xc5 d6xc5 >00:00 0 0 17603 (2303) 6 0.196 Rf8-d8 Bb2-d4 Bb7-a8 Ra1-b1 Nd7-c5 Na4xc5 d6xc5 >00:01 0 0 92406 (9177) 7 0.214 Rf8-d8 Bb2-d4 Bb7-a8 Qe2-e3 Nd7-c5 Na4xc5 d6xc5 >++ b7-e4 >00:02 0 0 155152 (14850) 7 0.765 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Bb2xf6 Be7xf >6 Be4-c6 Rc4xa4 Bc6xd7 Bf6xa1 Rd1xa1 >00:03 0 0 201232 (15711) 8 1.006 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Na4xb6 Rc4xe >4 Bb2-d4 >00:05 0 0 345901 (17111) 9 1.055 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Rd1-d4 Rc4xd >4 Bb2xd4 Nf6xe4 Na4xb6 >00:08 0 0 601497 (20381) 10 0.661 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Na4xb6 Rc4x >e4 Nb6xd7 Nf6xd7 Bb2-c3 >00:14 0 0 1407418 (20427) 11 1.181 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Na4xb6 Rc4 >xe4 Nb6xd7 Nf6xd7 a3-a4 Rf8-c8 >00:21 0 0 2484003 (20675) 12 0.886 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Na4xb6 Rc4 >xe4 Nb6xd7 Nf6xd7 Nf3-d2 Re4-a4 g2-g3 >00:46 0 0 5861681 (163344) 13 1.072 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Na4xb6 Rc >4xe4 Nb6xd7 Nf6xd7 Nf3-d4 Be7-f6 a3-a4 d6-d5 >01:27 0 0 11748805 (387481) 14 0.724 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Na4xb6 R >c4xe4 Nb6xd7 Nf6xd7 Nf3-d4 Be7-f6 f2-f3 Re4-e3 a3-a4 >04:52 0 0 36402494 (1490825) 15 0.949 Bb7xe4 Bd3xe4 Qc7xc4 Qe2xc4 Rc8xc4 Na4xb6 >Rc4xe4 Nb6xd7 Nf6xd7 Nf3-d4 Be7-f6 f2-f3 Re4-e3 Kg1-f2 Bf6-g5 > >At 8 ply i do not notice any speedup. > >But the speedup clearly slowly goes up when depth goes up. > >So the speedup goes slowly to 2.0 to > 2.0 at depth==15 and no modifications >in PV and no changes in root score. > >Speedup at 15 ply is actually 16:52 / 4:52 = 1012 / 292 = 3.47 speedup. > >Now this is an easy position to get a good speedup of course, so don't think >that my *average* speedup is 3.47. Neither do I. You have a serious bug if you can produce 3.47 using two processors, on different positions. It simply is not possible. IE if you can go 4x faster with 2 processors, you can go 16 x faster with 4 processors. That I don't believe as I have _never_ seen anyone else report such a number and not later find serious bugs. > >Note that the above diep version i used is a version *without* >dangerous extensions. > >>In each of the two positions tested above, I ran the thing with one >>processor and then with 4 processors. Everything else >>remained exactly the same. The first position is kopec 22, >>the second is kopec23. Chosen at random for no good reason. > >Easy move ordering in the first position and score is not >slowly going down so i'm not so sure whether >it's smart to use this own, but here the results from DIEP speak >for itself. Also at one of the positions you tried you get a speedup increase. > >Now i'll be the last to say that the worst case speedup from crafty at 4 >processors is very good. But definitely there is evidence for you to >keep testing also but not with APHID :) > >>I can do perfectly fine with very short searches. Or very long searches. > >I find 2.75 at short search versus 4.0 at longer search very clearly >not what you conclude. basically 3x vs 4x. I have some start-up costs for single-position searches that makes this look worse than it really is. I start the 3 extra threads _after_ the search clock is started, so that I get "billed" for creating three new processes. Once they are created, they will last for the whole game, but in this case, the whole game is one move. The cost hurts. I could have run a dummy problem first to get them started, but it wasn't worth the trouble. > >>But remember that you and I are not splitting the tree in the same way, >>because I can split at the root when it is reasonable to do so. that >>helps the speedup quite a bit. > >It doesn't work for me and i find it very dangerous. Here in this position >the ply after you have found Bxe4 this is not a big problem of course. Splitting at the root is not dangerous. In fact, it is the _optimal_ place to split after you have searched the first move. You get _zero_ search overhead due to bad move ordering. Make that 0.00... > >> >>> >>>I don't want to crack down results from the APHID team back in 1996, >>>but we must clearly realize that what worked back then is NOT going to >>>work in 2001 anymore! >> >>Again, would you care to make a small wager? They got about 33% efficiency >>on their hardware. I think that can be improved on with some more work. They >>only wanted to get it to work, not work very efficiently. And I think that >>50% efficiency is doable. And on my giganet hardware I think that well beyond >>50% is very doable. > >33% with 8 ply searches in 1996 with a program from 2500 nodes a second at >16 expensive workstations, i find that very poor! > >Even if you throw over the ball by hand you can get a better speedup at >8 ply with that a slow speed! > >I got single cpu end 1996 at a pentiumpro200 already 5000-10000 nodes >a second, and i'm pretty sure that DIEP back in 1996 also was not a very >fast program, they needed 16 very expensive >workstations to get 2500 nodes a second in total! I haven't looked at their data. But I have worked with parallel applications for a long time. And I claim to understand the ins and outs of alpha/beta search and parallel issues pretty well. I don't see a _thing_ that would suggest that their speedup won't hold on today's hardware. I have had a couple of students do parallel othello programs on a distributed cluster. They did pretty well... > >If i would slow down diep factor 100 or something, then i'm sure that i >can even get a speedup over the internet with it at 16 nodes better than >what these guys showed! No it isn't. They used whatever hardware they had at the time. If you ramp up the CPU speed, _and_ the network speed, things tend to scale uniformly if the two speeds increase uniformly. > >> >>> >>>>>This refers to the fact that YBW for each node needs a result for the >>>>>first move first before splitting other moves. Now this is of course very >>>>>true, but this directly also shows that APHIDs branching factor is by >>>>>definition bigger than algorithms which use YBW. >>>> >>>>I don't think they make any claims that their branching factor is better >>>>than YBW. Would seem pretty silly to me. But then again, YBW probably >>>>gets killed in different ways when having to deal with slow interprocess >>>>comms on a cluster. >> >>YBW doesn't exclude clusters. In a tree search, the number of "split" >>operations has to be constrained so that the overhead of splitting doesn't >>dominate the work done in actually searching the tree. I control this in >>Crafty by limiting how near the tips of the branches I can split the tree. > >Yes that's true, but you might underestimate something very important >here. You use nullmove bigtime in crafty. That means that missing >transpositiontable lookups from other processes last few plies means that >you definitely have a problem. It isn't _that_ big of a problem. I ran that test for you a long while back about not hashing in the last N plies of the search. It didn't kill me nearly as badly as you were saying it would. And in some cases it probably makes sense, in fact. You already do that if you store "depth-preferred" as the last few plies can't overwrite all those deeper plies. > >Also another small thing is that each processor has a local hashtable; >Rainer Feldmann said: "let's see all the RAM of each node as one big >distributed transpositiontable". > >Now that means that his hashtable was n times bigger in his approach >than might be the case in your current plans! I plan on doing shared hash too. At least at early plies in the tree. Near the tips I might do local-only, I don't know. > >>For a distributed approach, I intend on a second bound, which >>limits distributed >>splits even more severely, because the overhead is going to be higher in net >>communication/latency. > >In that case you cannot distribute hashtable even. Also the branching factor >is going to be huge and the time needed to split a subtree is going to be >seconds. I can split a tree with one packet. It won't take seconds. > >To avoid loads of idle time this would mean that a search is going to need >tens of minutes otherwise the relation between idling and searching is >going to be bigtime in the advantage of the idling. That is nonsense. If I can guarantee that a piece of work will take at least one second, the overhead will be minimal. Because I can send a big packet in 1/100th of a second using simple 100mbit ethernet. > >>I think the difference between myself and Vincent is that I am in the business >>of trying to find out _how_ to accomplish something, no matter how difficult it >>might seem at first glance. I don't assume everything is impossible if I >>haven't done it myself. I assume everything is possible, until I prove that it >>can't be done by trying every possible alternative I can think of. IE I would >>rather drive a nail using a hammer. But I have done it using pliers, a wrench, >>a rock, a screwdriver handle, and even a C-clamp. > >Well i'm from europe and i we're more careful here, >i already have done huge calculations and many network >experiments on whether it was possible > a) to get diep working over the network (100mbits) with a very limited > number of nodes (n=2) > b) possibility to not have a completely shared hashtable > c) distribution over the net > >Of course c) i didn't even start experimenting with because the 100mbits >already proved not enough for nowadays search depths. Also the big bummers >for distributed DIEP was the big system time wasted to network communication. > >On paper it is little, in reality this slows down a program X times. > >If you don't slow down the problem is branching factor as i clearly >can show here. My speedup is 2.0 or more at bigger depths. > >My speedup sucks at 5-8 ply. At 8 ply in fact the speedup here reads 0.00 > >Speedup at bigger depths 2.0 though. That is just a shortcoming of _your_ algorithm. I don't have that problem at 8 plies. And that is a danger of drawing conclusions about _all_ programs when looking just at yours. Just hold on a while... we will continue this once I have a distributed version up. Then we can compare notes to see whether I get that 33% (or higher) speed up or not. > >>It just takes a bit more >>thought and work, but it isn't impossible. Now nails are being driven by >>compressed air, EM impulse, blanks in a gun using the nail as the projectile, >>etc. All probably thought "impossible" 40 years ago. >> >> >> >>>> >>>>-- >>>>GCP
This page took 0.09 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.