Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Crap statement refuted about parallel speedup

Author: Vincent Diepeveen

Date: 17:36:17 09/18/01

Go up one level in this thread


On September 18, 2001 at 18:27:01, Robert Hyatt wrote:

>On September 18, 2001 at 17:19:54, Vincent Diepeveen wrote:
>
>>On September 18, 2001 at 16:53:30, Robert Hyatt wrote:
>>
>>>On September 18, 2001 at 14:45:45, Vincent Diepeveen wrote:
>>
>>[snip]
>>
>>>>At 8 ply i do not notice any speedup.
>>>>
>>>>But the speedup clearly slowly goes up when depth goes up.
>>>>
>>>>So the speedup goes slowly to 2.0 to > 2.0 at depth==15 and no modifications
>>>>in PV and no changes in root score.
>>>>
>>>>Speedup at 15 ply is actually 16:52 / 4:52 = 1012 / 292 = 3.47 speedup.
>>>>
>>>>Now this is an easy position to get a good speedup of course, so don't think
>>>>that my *average* speedup is 3.47.
>>>
>>>
>>>Neither do I.  You have a serious bug if you can produce 3.47 using two
>>>processors, on different positions.  It simply is not possible.
>>
>>Now this is the biggest crap i ever heart, because i have a good
>>parallel implementation, completely in contradiction with the rest of
>>the world, of course i most likely also put more hours into this project
>>to achieve that, you are now claiming that this is not possible?
>
>
>That is correct.  If you can produce a speedup of > 2x, using 2 cpus, you

You're not reading what i write and your example sucks, as the
effect only happens at say 15 minutes a move, and running 2 processes
at 1 cpu is too much overhead, only after 1 hour i get a speedup,
which i never get in a game for each move.

Of course i tested this bigtime!

If your read a bit better what i mentionned then it's clear
that no programs move ordering is very good in contradiction to some
crap statements a few years ago in RGCC when not many except some
commercial guys had measured their cutoff rates and their flip rates.

>have a bug, or an anomaly position.  It is simply not possible.  If you
>are _really_ doing that, you should modify your single-cpu version as
>follows:

Basically you also say here that Fritz is having a bug, because fritz
has the same like DIEP here, only the difference is that the overhead
from fritz is that big that it takes like half a day before a parallel
search is giving more than 2.0 speedup.

Ah DANG, more programs have it.

I bet you didn't know that!!

Everyone who has spent lots of times
into parallellizing a program in a proper way is having this effect.

You wrote your parallellism in crafty within 2 or 3 months and then of
course some bugfixing. In the meantime you had probably the most busiest
part of the year, etc.

Others had major problems with parallellism, simply didn't manage, and
i can imagine. It is a disaster to get good working parallellism!

Some french a look like algorithms are very popular now. Of course
2.0 won't happen with these algorithms. Not even 1.7 or 1.8.

Other programs which extend a lot (so processors in different search spaces)
they report a way smaller speedup than others who use the same method!

It all comes down to very simple conclusion which you don't seem to
want to make, "because our move ordering is near perfect anyway,"
to quote a popular line from some 4 years ago from you!

this assumption is however INCORRECT. Move ordering still can give
A LOT smaller search trees!

Better eval ==> more move ordering needed

>search normally until you would choose to use a second process if you had one.
>From this point forward, time-slice between two search threads.  Search one
>node on one, one on the other.  That should run twice as fast, since it is
>_exactly_ how the program really does on the SMP platform.

As you can see the effect only happens after say 15 minutes and i do not
have 15 minutes of time.

Faster hardware will not reduce the problem either.

Note that usually i run with stupid extensions in a game also and then
my speedup is then around 1.8 when it is 2.0 or more now and with these
extensions i hardly get above 1.8 also because the searchspaces of
both processors do not overlap, so my speedup is 2.0 minus slowdown which
is 0.2 leads to 1.8 speedup.

>And that will _not_ work unless you have either the world's worse sequential
>search algorithm, or the world's buggiest parallel search algorithm.

How on earth can both Fritz, DIEP have worse search algorithms than you use?

>Before you label a statement as ridiculous, find me _one_ person that will
>side with you and say that "yes, it is possible to get a > 2 speedup using
>only two cpus on anything but anomaly positions.  Find _one_ person that will
>agree...
>
>
>
>
>
>>
>>Let me list some things:
>>  a) your own move ordering sucks from theoretical viewpoint
>>  b) mine sucks even more from theoretical viewpoint
>>  c) DIEP needs every ply less nodes than crafty
>>  d) DIEP is not recursive
>
>
>All interesting observations.  But totally irrelevant for the argument
>at hand, namely producing consistent super-linear speedups.
>
>
>
>>  e) DIEP is a slower program than crafty so parallel overhead
>>     is less expressed in searchtime
>>  f) In diep i always print out the number of nodes for each searchdepth
>>     in crafty you do not do this, so i can easily see that at bigger
>>     depths i need LESS nodes. You only see search times. No one knows how
>>     much overhead is in number of nodes for crafty though.
>
>
>my NPS is constant, so displaying the search time or node count is
>providing identical information.
>
>
>
>>
>>How many nodes do you get to get to the same depths?
>
>
>_always_ more nodes using 2 cpus.  If you don't you have a bug, except that
>in rare positions you can get fewer.  Kopec 1 is a good example.  The checking
>move is a very short mate.  Other moves are not mates.  If you search the
>non-checks first, it takes more nodes.  If you search the check first, it takes
>less.  The parallel search might search the check sooner than the sequential
>search and find the mate (and adjust alpha/beta bounds) much quicker to shrink
>the tree.  But that is an _exception_.  For every one of those, there are 100
>that don't behave that way.
>
>
>
>>
>>You know i only use nulllmove, i do checks in qsearch, i do not use
>>futility or any other crap. No other forward pruning!
>
>
>so what, again?  You just told me that it takes you two hours to dig a
>hole.  And that if I give you two people, you can dig it in less than one
>hour.  That is an amazing thing that violates the conservation of energy in
>physics.
>
>Twice the power can only do twice the work, not 3x or 4x.  _period_.
>
>
>>
>>The chance that at a position which is a cut node, that DIEP
>>gets a cutoff there that chance is on average 0.85
>>
>>You  measured crafty to have 0.90
>>
>>So our move orderings suck.
>
>
>92 is my number.  that does not 'suck'.  Since 100% would be impossible to
>do.
>
>
>
>
>>
>>Now imagine you search parallel and as you know my fliprate
>>is way less than 1%, in crafty it is 5% or more.
>>
>>So
>>  a) diep is more than crafty researching the same gamespace, explaining
>>     of course why i need less nodes for each ply
>>
>>  b) that means most likely that if i split that the chance that if i split
>>     that the chance that DIEP does unnecessary work is smaller than
>>     crafty. In fact it's a galaxy smaller!
>>
>>  c) crafty has way more overhead because it is recursive
>
>
>What does this mean?  I have done searches _both_ ways.  Cray Blitz was
>iterative, crafty is recursive.  Overhead is the same.  The only problem is
>that in Crafty, I can split at a "current ply" or not at all.  In cray blitz
>I could split at a current ply or any ply before those.  The performance
>difference is not that different.  And Cray Blitz didn't produce speedups of
>> 4 with 4 cpus...
>
>
>
>
>
>>
>>Now c) is very important here to realize that where at a bigger iteration
>>each processor in DIEP is simply splitting less and less, in crafty
>>this is relatively bigger as it needs to help the other processor time
>>after time with its search tree. So if we compare depths at where
>>we on average split from the root, then we will see very clearly an
>>advantage for DIEP.
>>
>
>
>Nope, because I split at the root also.  I have seen searches to 15
>plies that did exactly 15 splits and no more until right at the end
>when they try to help each other finish the last few branches on the last
>move searched.
>
>
>
>>All this explains easily why my 0.85 improves to a better number when
>>i search deeper and why with crafty this is less likely to be the case.
>>
>>Just the *possibility* that this 0.85 improves already is the easy proof
>>that speedups of over 2.0 is what i get at analysis levels.
>
>
>This is simply nonsense.  speedups of > 2.0 are simply bugs that are waiting
>to be found.  If they are consistent, even worse.
>
>
>
>
>>
>>To show the other position you tested, here i have a 2.8 speedup at 2
>>processors:
>>
>
>
>that only shows that you have bugs galore, I am afraid.  Again, find anyone
>that will support your theory about > 2.00 speedups using only 2 cpus.  If
>that is the case, why can't you get > 1.00 speedup using _one_???
>
>
>
>
>
>>00:01 0 0 81407 (0) 7 0.258 Bc8-f5 Qd4-d2 O-O Nc3-e2 c6-c5 Ne2-g3 Bf5-e4 Ng3xe4
>>d5xe4
>>00:03 0 0 202108 (0) 8 0.174 Bc8-f5 Qd4-d2 Qd8-a5 Qd2-f2 O-O Be3-d4 Qa5-c7 Ra1-e
>>1
>>00:19 0 0 1173384 (0) 9 0.317 Bc8-f5 Qd4-d1 O-O Nc3-e2 c6-c5 Ne2-g3 Bf5-e4 f4-f5
>> d5-d4 Ng3xe4 d4xe3
>>00:28 0 0 1767448 (0) 10 0.155 Bc8-f5 Qd4-d1 O-O Nc3-e2 Qd8-d7 Ne2-d4 Bf5-e4 f4-
>>f5 c6-c5 e5-e6 f7xe6 f5xe6
>>01:35 0 0 6079251 (0) 11 0.237 Bc8-f5 Qd4-d1 O-O Nc3-e2 Bf5-e4 Ne2-g3 Rf8-e8 Qd1
>>-d2 Qd8-c7 Ra1-d1 Re8-d8
>>03:26 0 0 13042725 (0) 12 0.070 Bc8-f5 Qd4-d1 Qd8-a5 Nc3-e2 Bf5-e4 Ne2-d4 O-O Qd
>>1-h5 Qa5-b4 Rf1-b1 Be7-c5 c2-c3
>>++ b7-b6
>>24:22 0 0 84331794 (0) 12 0.079 b7-b6 Nc3-e2 Bc8-a6 Qd4-d2 O-O Rf1-d1 c6-c5 Be3-
>>f2 Qd8-d7 Ne2-g3 d5-d4 c2-c3
>>48:18 0 0 169226934 (0) 13 0.131 b7-b6 Qd4-d3 a7-a5 Nc3-e2 Bc8-a6 Qd3-d1 O-O Rf1
>>-e1 f7-f6 Ne2-d4 Qd8-c8 e5-e6 Be7-d6
>>++ c8-f5
>>50:54 0 0 178647431 (0) 13 0.194 Bc8-f5 Qd4-d1 O-O Nc3-e2 Qd8-d7 Qd1-d2 Bf5-e4 N
>>e2-g3 Rf8-d8 f4-f5 Qd7-c8 Be3-d4
>>01:06:07 0 0 232264385 (0) 14 0.075 Bc8-f5 Qd4-d1 O-O Nc3-e2 Bf5-g4 Qd1-d3 f7-f6
>> h2-h3 Bg4-d7 a4-a5 f6xe5 f4xe5 Qd8-c7 a5-a6
>>
>>00:01 0 0 83514 (7267) 7 0.258 Bc8-f5 Qd4-d2 O-O Nc3-e2 c6-c5 Ne2-g3 Bf5-e4 Ng3x
>>e4 d5xe4
>>00:02 0 0 207403 (63479) 8 0.174 Bc8-f5 Qd4-d2 Qd8-a5 Qd2-f2 O-O Be3-d4 Qa5-c7 R
>>a1-e1
>>00:11 0 0 1019046 (236794) 9 0.317 Bc8-f5 Qd4-d1 O-O Nc3-e2 c6-c5 Ne2-g3 Bf5-e4
>>f4-f5 d5-d4 Ng3xe4 d4xe3
>>00:17 0 0 1750547 (263775) 10 0.254 Bc8-f5 Qd4-d1 O-O Nc3-e2 Bf5-e4 c2-c3 c6-c5
>>Ne2-g3 Qd8-b6 Qd1-e2
>>00:41 0 0 4840190 (582946) 11 0.237 Bc8-f5 Qd4-d1 O-O Nc3-e2 Bf5-e4 Ne2-g3 Rf8-e
>>8 Qd1-d2 Qd8-c7 Ra1-d1 Re8-d8
>>02:18 0 0 16340868 (3553884) 12 0.070 Bc8-f5 Qd4-d1 Qd8-a5 Nc3-e2 Bf5-e4 Ne2-d4
>>O-O Qd1-h5 Qa5-b4 Rf1-b1 Be7-c5 c2-c3
>>++ b7-b6
>>10:41 0 0 72051608 (11640625) 12 0.079 b7-b6 Qd4-d2 O-O Nc3-e2 Bc8-a6 Rf1-d1 c6-
>>c5 Be3-f2 Qd8-d7 Ne2-g3 d5-d4 c2-c3
>>quit
>>
>>
>>>IE if you can go 4x faster with 2 processors, you can go 16 x faster with
>>>4 processors.  That I don't believe as I have _never_ seen anyone else report
>>>such a number and not later find serious bugs.
>>
>>I do not have parallel bugs i am very sure of that. A good proof is
>>that i nearly always get the same score and mainline in both parallel
>>and single cpu searches. How's crafty here?
>
>That doesn't say a thing about bugs.  Crafty reproduces most of the time,
>but not _all_ of the time.  Which theory predicts, by the way.
>
>
>
>
>>
>>I turned off the dangerous extensions in this test too, so that
>>dubious things which are alfabeta dependant did not influence the test
>>and that i would search more in the same search space. When i turn them
>>on just like crafty the both processors search less in the same space
>>and my speedup hardly gets over 1.8 then.
>>
>>Note that you use things like futility pruning bigtime in crafty.
>>This is very easily shown dubious, but it obviously means that you
>>get deeper quicker initially. Perhaps at bigger depths it is payback
>>time!
>
>
>I only do futility pruning in the q-search.  I did it exactly the same way
>in Cray Blitz also (tossing losing SEE captures).
>
>
>
>
>>
>>The logica that i would be 16 times faster on 4 processors is not true
>>of course.
>>
>>At 4 processors my overhead is way way bigger than at 2
>>processors to start with.
>>
>
>
>You can't have it both ways.  4.0 x with 2 has to translate to 16x with 4,
>otherwise we are not in base 10 math.
>
>
>
>
>
>>Also i limit the number of processors at each node
>>to at most to 2. I want to cutoff quicker without risking that 75% of
>>my cpu's do superfluous work.
>>So then as a result of that 2 limitation is that it is not possible
>>to ideally split at 4 processors like it is possible with 2.
>>
>>Last but least, a factor 2.8 or 3.5 at 2 processors is a very good
>>speedup, but that's not near the magnitude of speedup of 16 at 4
>>processors.
>
>3.5 at 2 is 14 at 4.  That is still an impossible number...
>
>You can wave your hands all you want, but 1+1 is _still_ <= 2, never
>greater.  To think otherwise simply defies logic and common sense.
>
>
>
>
>
>
>>
>>>>
>>>>Note that the above diep version i used is a version *without*
>>>>dangerous extensions.
>>>>
>>>>>In each of the two positions tested above, I ran the thing with one
>>>>>processor and then with 4 processors.  Everything else
>>>>>remained exactly the same.  The first position is kopec 22,
>>>>>the second is kopec23.  Chosen at random for no good reason.
>>>>
>>>>Easy move ordering in the first position and score is not
>>>>slowly going down so i'm not so sure whether
>>>>it's smart to use this own, but here the results from DIEP speak
>>>>for itself. Also at one of the positions you tried you get a speedup increase.
>>>>
>>>>Now i'll be the last to say that the worst case speedup from crafty at 4
>>>>processors is very good. But definitely there is evidence for you to
>>>>keep testing also but not with APHID :)
>>>>
>>>>>I can do perfectly fine with very short searches.  Or very long searches.
>>>>
>>>>I find 2.75 at short search versus 4.0 at longer search very clearly
>>>>not what you conclude.
>>>
>>>
>>>basically 3x vs 4x.  I have some start-up costs for single-position searches
>>>that makes this look worse than it really is.  I start the 3 extra threads
>>>_after_ the search clock is started, so that I get "billed" for creating three
>>>new processes.  Once they are created, they will last for the whole game, but
>>>in this case, the whole game is one move.  The cost hurts.  I could have run a
>>>dummy problem first to get them started, but it wasn't worth the trouble.
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>>But remember that you and I are not splitting the tree in the same way,
>>>>>because I can split at the root when it is reasonable to do so.  that
>>>>>helps the speedup quite a bit.
>>>>
>>>>It doesn't work for me and i find it very dangerous. Here in this position
>>>>the ply after you have found Bxe4 this is not a big problem of course.
>>>
>>>Splitting at the root is not dangerous.  In fact, it is the _optimal_ place
>>>to split after you have searched the first move.  You get _zero_ search
>>>overhead due to bad move ordering.  Make that 0.00...
>>>
>>>>
>>>>>
>>>>>>
>>>>>>I don't want to crack down results from the APHID team back in 1996,
>>>>>>but we must clearly realize that what worked back then is NOT going to
>>>>>>work in 2001 anymore!
>>>>>
>>>>>Again, would you care to make a small wager?  They got about 33% efficiency
>>>>>on their hardware.  I think that can be improved on with some more work.  They
>>>>>only wanted to get it to work, not work very efficiently.  And I think that
>>>>>50% efficiency is doable.  And on my giganet hardware I think that well beyond
>>>>>50% is very doable.
>>>>
>>>>33% with 8 ply searches in 1996 with a program from 2500 nodes a second at
>>>>16 expensive workstations, i find that very poor!
>>>>
>>>>Even if you throw over the ball by hand you can get a better speedup at
>>>>8 ply with that a slow speed!
>>>>
>>>>I got single cpu end 1996 at a pentiumpro200 already 5000-10000 nodes
>>>>a second, and i'm pretty sure that DIEP back in 1996 also was not a very
>>>>fast program, they needed 16 very expensive
>>>>workstations to get 2500 nodes a second in total!
>>>
>>>I haven't looked at their data.  But I have worked with parallel applications
>>>for a long time.  And I claim to understand the ins and outs of alpha/beta
>>>search and parallel issues pretty well.  I don't see a _thing_ that would
>>>suggest that their speedup won't hold on today's hardware.  I have had a
>>>couple of students do parallel othello programs on a distributed cluster.  They
>>>did pretty well...
>>>
>>>
>>>
>>>>
>>>>If i would slow down diep factor 100 or something, then i'm sure that i
>>>>can even get a speedup over the internet with it at 16 nodes better than
>>>>what these guys showed!
>>>
>>>No it isn't.  They used whatever hardware they had at the time.  If you
>>>ramp up the CPU speed, _and_ the network speed, things tend to scale
>>>uniformly if the two speeds increase uniformly.
>>>
>>>
>>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>>This refers to the fact that YBW for each node needs a result for the
>>>>>>>>first move first before splitting other moves. Now this is of course very
>>>>>>>>true, but this directly also shows that APHIDs branching factor is by
>>>>>>>>definition bigger than algorithms which use YBW.
>>>>>>>
>>>>>>>I don't think they make any claims that their branching factor is better
>>>>>>>than YBW. Would seem pretty silly to me. But then again, YBW probably
>>>>>>>gets killed in different ways when having to deal with slow interprocess
>>>>>>>comms on a cluster.
>>>>>
>>>>>YBW doesn't exclude clusters.  In a tree search, the number of "split"
>>>>>operations has to be constrained so that the overhead of splitting doesn't
>>>>>dominate the work done in actually searching the tree.  I control this in
>>>>>Crafty by limiting how near the tips of the branches I can split the tree.
>>>>
>>>>Yes that's true, but you might underestimate something very important
>>>>here. You use nullmove bigtime in crafty. That means that missing
>>>>transpositiontable lookups from other processes last few plies means that
>>>>you definitely have a problem.
>>>
>>>
>>>It isn't _that_ big of a problem. I ran that test for you a long while back
>>>about not hashing in the last N plies of the search.  It didn't kill me nearly
>>>as badly as you were saying it would.  And in some cases it probably makes
>>>sense, in fact.  You already do that if you store "depth-preferred" as the
>>>last few plies can't overwrite all those deeper plies.
>>>
>>>
>>>
>>>>
>>>>Also another small thing is that each processor has a local hashtable;
>>>>Rainer Feldmann said: "let's see all the RAM of each node as one big
>>>>distributed transpositiontable".
>>>>
>>>>Now that means that his hashtable was n times bigger in his approach
>>>>than might be the case in your current plans!
>>>
>>>I plan on doing shared hash too.  At least at early plies in the tree.
>>>Near the tips I might do local-only, I don't know.
>>>
>>>>
>>>>>For a distributed approach, I intend on a second bound, which
>>>>>limits distributed
>>>>>splits even more severely, because the overhead is going to be higher in net
>>>>>communication/latency.
>>>>
>>>>In that case you cannot distribute hashtable even. Also the branching factor
>>>>is going to be huge and the time needed to split a subtree is going to be
>>>>seconds.
>>>
>>>I can split a tree with one packet.  It won't take seconds.
>>>
>>>
>>>
>>>>
>>>>To avoid loads of idle time this would mean that a search is going to need
>>>>tens of minutes otherwise the relation between idling and searching is
>>>>going to be bigtime in the advantage of the idling.
>>>
>>>That is nonsense.  If I can guarantee that a piece of work will take at least
>>>one second, the overhead will be minimal.  Because I can send a big packet in
>>>1/100th of a second using simple 100mbit ethernet.
>>>
>>>
>>>
>>>>
>>>>>I think the difference between myself and Vincent is that I am in the business
>>>>>of trying to find out _how_ to accomplish something, no matter how difficult it
>>>>>might seem at first glance.  I don't assume everything is impossible if I
>>>>>haven't done it myself.  I assume everything is possible, until I prove that it
>>>>>can't be done by trying every possible alternative I can think of.  IE I would
>>>>>rather drive a nail using a hammer.  But I have done it using pliers, a wrench,
>>>>>a rock, a screwdriver handle, and even a C-clamp.
>>>>
>>>>Well i'm from europe and i we're more careful here,
>>>>i already have done huge calculations and many network
>>>>experiments on whether it was possible
>>>>  a) to get diep working over the network (100mbits) with a very limited
>>>>     number of nodes (n=2)
>>>>  b) possibility to not have a completely shared hashtable
>>>>  c) distribution over the net
>>>>
>>>>Of course c) i didn't even start experimenting with because the 100mbits
>>>>already proved not enough for nowadays search depths. Also the big bummers
>>>>for distributed DIEP was the big system time wasted to network communication.
>>>>
>>>>On paper it is little, in reality this slows down a program X times.
>>>>
>>>>If you don't slow down the problem is branching factor as i clearly
>>>>can show here. My speedup is 2.0 or more at bigger depths.
>>>>
>>>>My speedup sucks at 5-8 ply. At 8 ply in fact the speedup here reads 0.00
>>>>
>>>>Speedup at bigger depths 2.0 though.
>>>
>>>That is just a shortcoming of _your_ algorithm.  I don't have that problem
>>>at 8 plies.
>>>
>>>And that is a danger of drawing conclusions about _all_ programs when looking
>>>just at yours.  Just hold on a while... we will continue this once I have a
>>>distributed version up.  Then we can compare notes to see whether I get that
>>>33% (or higher) speed up or not.
>>>
>>>>
>>>>>It just takes a bit more
>>>>>thought and work, but it isn't impossible.  Now nails are being driven by
>>>>>compressed air, EM impulse, blanks in a gun using the nail as the projectile,
>>>>>etc.  All probably thought "impossible" 40 years ago.
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>>--
>>>>>>>GCP



This page took 0.09 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.