Computer Chess Club Archives


Search

Terms

Messages

Subject: Crap statement refuted about parallel speedup

Author: Vincent Diepeveen

Date: 14:19:54 09/18/01

Go up one level in this thread


On September 18, 2001 at 16:53:30, Robert Hyatt wrote:

>On September 18, 2001 at 14:45:45, Vincent Diepeveen wrote:

[snip]

>>At 8 ply i do not notice any speedup.
>>
>>But the speedup clearly slowly goes up when depth goes up.
>>
>>So the speedup goes slowly to 2.0 to > 2.0 at depth==15 and no modifications
>>in PV and no changes in root score.
>>
>>Speedup at 15 ply is actually 16:52 / 4:52 = 1012 / 292 = 3.47 speedup.
>>
>>Now this is an easy position to get a good speedup of course, so don't think
>>that my *average* speedup is 3.47.
>
>
>Neither do I.  You have a serious bug if you can produce 3.47 using two
>processors, on different positions.  It simply is not possible.

Now this is the biggest crap i ever heart, because i have a good
parallel implementation, completely in contradiction with the rest of
the world, of course i most likely also put more hours into this project
to achieve that, you are now claiming that this is not possible?

Let me list some things:
  a) your own move ordering sucks from theoretical viewpoint
  b) mine sucks even more from theoretical viewpoint
  c) DIEP needs every ply less nodes than crafty
  d) DIEP is not recursive
  e) DIEP is a slower program than crafty so parallel overhead
     is less expressed in searchtime
  f) In diep i always print out the number of nodes for each searchdepth
     in crafty you do not do this, so i can easily see that at bigger
     depths i need LESS nodes. You only see search times. No one knows how
     much overhead is in number of nodes for crafty though.

How many nodes do you get to get to the same depths?

You know i only use nulllmove, i do checks in qsearch, i do not use
futility or any other crap. No other forward pruning!

The chance that at a position which is a cut node, that DIEP
gets a cutoff there that chance is on average 0.85

You  measured crafty to have 0.90

So our move orderings suck.

Now imagine you search parallel and as you know my fliprate
is way less than 1%, in crafty it is 5% or more.

So
  a) diep is more than crafty researching the same gamespace, explaining
     of course why i need less nodes for each ply

  b) that means most likely that if i split that the chance that if i split
     that the chance that DIEP does unnecessary work is smaller than
     crafty. In fact it's a galaxy smaller!

  c) crafty has way more overhead because it is recursive

Now c) is very important here to realize that where at a bigger iteration
each processor in DIEP is simply splitting less and less, in crafty
this is relatively bigger as it needs to help the other processor time
after time with its search tree. So if we compare depths at where
we on average split from the root, then we will see very clearly an
advantage for DIEP.

All this explains easily why my 0.85 improves to a better number when
i search deeper and why with crafty this is less likely to be the case.

Just the *possibility* that this 0.85 improves already is the easy proof
that speedups of over 2.0 is what i get at analysis levels.

To show the other position you tested, here i have a 2.8 speedup at 2
processors:

00:01 0 0 81407 (0) 7 0.258 Bc8-f5 Qd4-d2 O-O Nc3-e2 c6-c5 Ne2-g3 Bf5-e4 Ng3xe4
d5xe4
00:03 0 0 202108 (0) 8 0.174 Bc8-f5 Qd4-d2 Qd8-a5 Qd2-f2 O-O Be3-d4 Qa5-c7 Ra1-e
1
00:19 0 0 1173384 (0) 9 0.317 Bc8-f5 Qd4-d1 O-O Nc3-e2 c6-c5 Ne2-g3 Bf5-e4 f4-f5
 d5-d4 Ng3xe4 d4xe3
00:28 0 0 1767448 (0) 10 0.155 Bc8-f5 Qd4-d1 O-O Nc3-e2 Qd8-d7 Ne2-d4 Bf5-e4 f4-
f5 c6-c5 e5-e6 f7xe6 f5xe6
01:35 0 0 6079251 (0) 11 0.237 Bc8-f5 Qd4-d1 O-O Nc3-e2 Bf5-e4 Ne2-g3 Rf8-e8 Qd1
-d2 Qd8-c7 Ra1-d1 Re8-d8
03:26 0 0 13042725 (0) 12 0.070 Bc8-f5 Qd4-d1 Qd8-a5 Nc3-e2 Bf5-e4 Ne2-d4 O-O Qd
1-h5 Qa5-b4 Rf1-b1 Be7-c5 c2-c3
++ b7-b6
24:22 0 0 84331794 (0) 12 0.079 b7-b6 Nc3-e2 Bc8-a6 Qd4-d2 O-O Rf1-d1 c6-c5 Be3-
f2 Qd8-d7 Ne2-g3 d5-d4 c2-c3
48:18 0 0 169226934 (0) 13 0.131 b7-b6 Qd4-d3 a7-a5 Nc3-e2 Bc8-a6 Qd3-d1 O-O Rf1
-e1 f7-f6 Ne2-d4 Qd8-c8 e5-e6 Be7-d6
++ c8-f5
50:54 0 0 178647431 (0) 13 0.194 Bc8-f5 Qd4-d1 O-O Nc3-e2 Qd8-d7 Qd1-d2 Bf5-e4 N
e2-g3 Rf8-d8 f4-f5 Qd7-c8 Be3-d4
01:06:07 0 0 232264385 (0) 14 0.075 Bc8-f5 Qd4-d1 O-O Nc3-e2 Bf5-g4 Qd1-d3 f7-f6
 h2-h3 Bg4-d7 a4-a5 f6xe5 f4xe5 Qd8-c7 a5-a6

00:01 0 0 83514 (7267) 7 0.258 Bc8-f5 Qd4-d2 O-O Nc3-e2 c6-c5 Ne2-g3 Bf5-e4 Ng3x
e4 d5xe4
00:02 0 0 207403 (63479) 8 0.174 Bc8-f5 Qd4-d2 Qd8-a5 Qd2-f2 O-O Be3-d4 Qa5-c7 R
a1-e1
00:11 0 0 1019046 (236794) 9 0.317 Bc8-f5 Qd4-d1 O-O Nc3-e2 c6-c5 Ne2-g3 Bf5-e4
f4-f5 d5-d4 Ng3xe4 d4xe3
00:17 0 0 1750547 (263775) 10 0.254 Bc8-f5 Qd4-d1 O-O Nc3-e2 Bf5-e4 c2-c3 c6-c5
Ne2-g3 Qd8-b6 Qd1-e2
00:41 0 0 4840190 (582946) 11 0.237 Bc8-f5 Qd4-d1 O-O Nc3-e2 Bf5-e4 Ne2-g3 Rf8-e
8 Qd1-d2 Qd8-c7 Ra1-d1 Re8-d8
02:18 0 0 16340868 (3553884) 12 0.070 Bc8-f5 Qd4-d1 Qd8-a5 Nc3-e2 Bf5-e4 Ne2-d4
O-O Qd1-h5 Qa5-b4 Rf1-b1 Be7-c5 c2-c3
++ b7-b6
10:41 0 0 72051608 (11640625) 12 0.079 b7-b6 Qd4-d2 O-O Nc3-e2 Bc8-a6 Rf1-d1 c6-
c5 Be3-f2 Qd8-d7 Ne2-g3 d5-d4 c2-c3
quit


>IE if you can go 4x faster with 2 processors, you can go 16 x faster with
>4 processors.  That I don't believe as I have _never_ seen anyone else report
>such a number and not later find serious bugs.

I do not have parallel bugs i am very sure of that. A good proof is
that i nearly always get the same score and mainline in both parallel
and single cpu searches. How's crafty here?

I turned off the dangerous extensions in this test too, so that
dubious things which are alfabeta dependant did not influence the test
and that i would search more in the same search space. When i turn them
on just like crafty the both processors search less in the same space
and my speedup hardly gets over 1.8 then.

Note that you use things like futility pruning bigtime in crafty.
This is very easily shown dubious, but it obviously means that you
get deeper quicker initially. Perhaps at bigger depths it is payback
time!

The logica that i would be 16 times faster on 4 processors is not true
of course.

At 4 processors my overhead is way way bigger than at 2
processors to start with.

Also i limit the number of processors at each node
to at most to 2. I want to cutoff quicker without risking that 75% of
my cpu's do superfluous work.
So then as a result of that 2 limitation is that it is not possible
to ideally split at 4 processors like it is possible with 2.

Last but least, a factor 2.8 or 3.5 at 2 processors is a very good
speedup, but that's not near the magnitude of speedup of 16 at 4
processors.

>>
>>Note that the above diep version i used is a version *without*
>>dangerous extensions.
>>
>>>In each of the two positions tested above, I ran the thing with one
>>>processor and then with 4 processors.  Everything else
>>>remained exactly the same.  The first position is kopec 22,
>>>the second is kopec23.  Chosen at random for no good reason.
>>
>>Easy move ordering in the first position and score is not
>>slowly going down so i'm not so sure whether
>>it's smart to use this own, but here the results from DIEP speak
>>for itself. Also at one of the positions you tried you get a speedup increase.
>>
>>Now i'll be the last to say that the worst case speedup from crafty at 4
>>processors is very good. But definitely there is evidence for you to
>>keep testing also but not with APHID :)
>>
>>>I can do perfectly fine with very short searches.  Or very long searches.
>>
>>I find 2.75 at short search versus 4.0 at longer search very clearly
>>not what you conclude.
>
>
>basically 3x vs 4x.  I have some start-up costs for single-position searches
>that makes this look worse than it really is.  I start the 3 extra threads
>_after_ the search clock is started, so that I get "billed" for creating three
>new processes.  Once they are created, they will last for the whole game, but
>in this case, the whole game is one move.  The cost hurts.  I could have run a
>dummy problem first to get them started, but it wasn't worth the trouble.
>
>
>
>
>
>>
>>>But remember that you and I are not splitting the tree in the same way,
>>>because I can split at the root when it is reasonable to do so.  that
>>>helps the speedup quite a bit.
>>
>>It doesn't work for me and i find it very dangerous. Here in this position
>>the ply after you have found Bxe4 this is not a big problem of course.
>
>Splitting at the root is not dangerous.  In fact, it is the _optimal_ place
>to split after you have searched the first move.  You get _zero_ search
>overhead due to bad move ordering.  Make that 0.00...
>
>>
>>>
>>>>
>>>>I don't want to crack down results from the APHID team back in 1996,
>>>>but we must clearly realize that what worked back then is NOT going to
>>>>work in 2001 anymore!
>>>
>>>Again, would you care to make a small wager?  They got about 33% efficiency
>>>on their hardware.  I think that can be improved on with some more work.  They
>>>only wanted to get it to work, not work very efficiently.  And I think that
>>>50% efficiency is doable.  And on my giganet hardware I think that well beyond
>>>50% is very doable.
>>
>>33% with 8 ply searches in 1996 with a program from 2500 nodes a second at
>>16 expensive workstations, i find that very poor!
>>
>>Even if you throw over the ball by hand you can get a better speedup at
>>8 ply with that a slow speed!
>>
>>I got single cpu end 1996 at a pentiumpro200 already 5000-10000 nodes
>>a second, and i'm pretty sure that DIEP back in 1996 also was not a very
>>fast program, they needed 16 very expensive
>>workstations to get 2500 nodes a second in total!
>
>I haven't looked at their data.  But I have worked with parallel applications
>for a long time.  And I claim to understand the ins and outs of alpha/beta
>search and parallel issues pretty well.  I don't see a _thing_ that would
>suggest that their speedup won't hold on today's hardware.  I have had a
>couple of students do parallel othello programs on a distributed cluster.  They
>did pretty well...
>
>
>
>>
>>If i would slow down diep factor 100 or something, then i'm sure that i
>>can even get a speedup over the internet with it at 16 nodes better than
>>what these guys showed!
>
>No it isn't.  They used whatever hardware they had at the time.  If you
>ramp up the CPU speed, _and_ the network speed, things tend to scale
>uniformly if the two speeds increase uniformly.
>
>
>
>
>>
>>>
>>>>
>>>>>>This refers to the fact that YBW for each node needs a result for the
>>>>>>first move first before splitting other moves. Now this is of course very
>>>>>>true, but this directly also shows that APHIDs branching factor is by
>>>>>>definition bigger than algorithms which use YBW.
>>>>>
>>>>>I don't think they make any claims that their branching factor is better
>>>>>than YBW. Would seem pretty silly to me. But then again, YBW probably
>>>>>gets killed in different ways when having to deal with slow interprocess
>>>>>comms on a cluster.
>>>
>>>YBW doesn't exclude clusters.  In a tree search, the number of "split"
>>>operations has to be constrained so that the overhead of splitting doesn't
>>>dominate the work done in actually searching the tree.  I control this in
>>>Crafty by limiting how near the tips of the branches I can split the tree.
>>
>>Yes that's true, but you might underestimate something very important
>>here. You use nullmove bigtime in crafty. That means that missing
>>transpositiontable lookups from other processes last few plies means that
>>you definitely have a problem.
>
>
>It isn't _that_ big of a problem. I ran that test for you a long while back
>about not hashing in the last N plies of the search.  It didn't kill me nearly
>as badly as you were saying it would.  And in some cases it probably makes
>sense, in fact.  You already do that if you store "depth-preferred" as the
>last few plies can't overwrite all those deeper plies.
>
>
>
>>
>>Also another small thing is that each processor has a local hashtable;
>>Rainer Feldmann said: "let's see all the RAM of each node as one big
>>distributed transpositiontable".
>>
>>Now that means that his hashtable was n times bigger in his approach
>>than might be the case in your current plans!
>
>I plan on doing shared hash too.  At least at early plies in the tree.
>Near the tips I might do local-only, I don't know.
>
>>
>>>For a distributed approach, I intend on a second bound, which
>>>limits distributed
>>>splits even more severely, because the overhead is going to be higher in net
>>>communication/latency.
>>
>>In that case you cannot distribute hashtable even. Also the branching factor
>>is going to be huge and the time needed to split a subtree is going to be
>>seconds.
>
>I can split a tree with one packet.  It won't take seconds.
>
>
>
>>
>>To avoid loads of idle time this would mean that a search is going to need
>>tens of minutes otherwise the relation between idling and searching is
>>going to be bigtime in the advantage of the idling.
>
>That is nonsense.  If I can guarantee that a piece of work will take at least
>one second, the overhead will be minimal.  Because I can send a big packet in
>1/100th of a second using simple 100mbit ethernet.
>
>
>
>>
>>>I think the difference between myself and Vincent is that I am in the business
>>>of trying to find out _how_ to accomplish something, no matter how difficult it
>>>might seem at first glance.  I don't assume everything is impossible if I
>>>haven't done it myself.  I assume everything is possible, until I prove that it
>>>can't be done by trying every possible alternative I can think of.  IE I would
>>>rather drive a nail using a hammer.  But I have done it using pliers, a wrench,
>>>a rock, a screwdriver handle, and even a C-clamp.
>>
>>Well i'm from europe and i we're more careful here,
>>i already have done huge calculations and many network
>>experiments on whether it was possible
>>  a) to get diep working over the network (100mbits) with a very limited
>>     number of nodes (n=2)
>>  b) possibility to not have a completely shared hashtable
>>  c) distribution over the net
>>
>>Of course c) i didn't even start experimenting with because the 100mbits
>>already proved not enough for nowadays search depths. Also the big bummers
>>for distributed DIEP was the big system time wasted to network communication.
>>
>>On paper it is little, in reality this slows down a program X times.
>>
>>If you don't slow down the problem is branching factor as i clearly
>>can show here. My speedup is 2.0 or more at bigger depths.
>>
>>My speedup sucks at 5-8 ply. At 8 ply in fact the speedup here reads 0.00
>>
>>Speedup at bigger depths 2.0 though.
>
>That is just a shortcoming of _your_ algorithm.  I don't have that problem
>at 8 plies.
>
>And that is a danger of drawing conclusions about _all_ programs when looking
>just at yours.  Just hold on a while... we will continue this once I have a
>distributed version up.  Then we can compare notes to see whether I get that
>33% (or higher) speed up or not.
>
>>
>>>It just takes a bit more
>>>thought and work, but it isn't impossible.  Now nails are being driven by
>>>compressed air, EM impulse, blanks in a gun using the nail as the projectile,
>>>etc.  All probably thought "impossible" 40 years ago.
>>>
>>>
>>>
>>>>>
>>>>>--
>>>>>GCP



This page took 0.08 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.