Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Crap statement refuted about parallel speedup

Author: Vincent Diepeveen

Date: 17:53:37 09/21/01

Go up one level in this thread


On September 18, 2001 at 22:11:19, Robert Hyatt wrote:

>On September 18, 2001 at 20:36:17, Vincent Diepeveen wrote:
>
>>
>>You're not reading what i write and your example sucks, as the
>>effect only happens at say 15 minutes a move, and running 2 processes
>>at 1 cpu is too much overhead, only after 1 hour i get a speedup,
>>which i never get in a game for each move.
>
>
>OK... wait for 3 years for hardware 10x faster.  then you can go more than
>1x faster with one cpu.  Which is going to be a _real_ trick to prove to
>the rest of the computer science world.
>
>
>>
>>Of course i tested this bigtime!
>>
>>If your read a bit better what i mentionned then it's clear
>>that no programs move ordering is very good in contradiction to some
>>crap statements a few years ago in RGCC when not many except some
>>commercial guys had measured their cutoff rates and their flip rates.
>>
>>>have a bug, or an anomaly position.  It is simply not possible.  If you
>>>are _really_ doing that, you should modify your single-cpu version as
>>>follows:
>>
>>Basically you also say here that Fritz is having a bug, because fritz
>>has the same like DIEP here, only the difference is that the overhead
>>from fritz is that big that it takes like half a day before a parallel
>>search is giving more than 2.0 speedup.
>
>
>I don't believe it for one second.  I haven't even seen Frans claim a
>speedup of 2.00 yet.  Last I heard he said "I don't get the same speedup
>as crafty does..."

There is some hard evidence here. Note that fritz DOES produce the
number of nodes it needs for each ply at the screen, in contradiction
with crafty.

Perhaps your parallel overhead is simply that huge that you hardly see
that sometimes you need less nodes, and i bet you never compare single
cpu node outputs with dual node outputs.

Of course these effects are hell harder on 4 cpu's as i explained, but
it seems you're not listening to that!

>
>
>>
>>Ah DANG, more programs have it.
>>
>>I bet you didn't know that!!

>Nor did you, I'll bet.

I was pretty amazed when i also saw Deepfritz having this indeed.
The overhead from parallel search is however that big, that it takes
a long time to see it!

>>
>>Everyone who has spent lots of times
>>into parallellizing a program in a proper way is having this effect.
>
>
>
>Vincent, that is simply _false_.  Want to bet I know more "everyones" than
>you do?  lets start.  Me.  Hsu.  Newborn.  Schaeffer.  Waycool group at caltech.

Your info is usually 20 years old. Like the Cray Blitz info.

You're one of the pioneers of computer chess, especially parallel
computer chess, let's not forget that, but this is 2001 and others
are being creative now!

Cray Blitz from 15 years ago would be blown away with a single
P3-800Mhz nowadays, for several reasons
  a) progress in algoritms (R=3)
  b) the machine is hell slower than nowadays single cpus even are
  c) book
  d) better testing
  e) better tuning
  f) more knowledgeable facts are tested into nowadays software
  g) less dubious forward pruning by most commercials

And the list goes on. You had excellent hardware for those days, but
compared with programs from these days and it gets extrapolated to today.

Also a major impact for parallel behaviour is that you used R=1 with Cray
blitz in a very limited way, versus nowadays R=3 is getting used and
near tips by some R=2 is getting used (i'm again back on R=3 there too).

>Warnock/Wendroff.  Spracklens.  Moreland.  Daily.  Several amateurs using SMP.
>I have
>heard _nobody_ claim to go more than 2x faster on 2 cpus.  It is a simply
>preposterous idea.  For the reason I gave.  You could use two threads on one

Ok let's start discussing them. The first 2 guys i never heart of.
Spracklens only programmed at small CPUs AFAIK and they didn't search
very deep *ever*.

Moreland works at a 4 processor machine using recursion and doing
all kind of dubious stuff the last so many plies. Also using singular
extensions and all other kind of extensions. So for sure already turning
on these extensions means chance he gets > 2x speedup is near zero.

Only if he kicks out all kind of dubious extensions and runs at 2 cpu's
he'll get near.

Daily is using the much praised cilk language to parallellize. SO HE
DOESN'T CONTROL THE PARALLEL SPEEDUP AT ALL!

So daily can shake forever getting a 2x speedup.

More persons to discuss?

Aha, Rudolf Huber with SOS, Patzer SMP, Shredder.

Well i'm not sure for Shredder, but for sure the first 2 guys are
using a completely different parallel strategy where *only* profits
can be made by means of hashtable somehow.

So getting 2.0 with that is very unlikely.

Which programs do i miss?

There are simply too little parallel programs Bob, this is the key problem!

There is crafty... ...and crafty... ...and crafty... ...o yes and DIEP

And the rest hardly posts anything about their parallel implementation
and speedups anyway.

So 50% of the programs where speedup results get published from is
getting a > 2.0 speedup!

Note that some positions also Crafty goes
faster than 2x  at 2 cpus. Did you forget that?

Now imagine that a few more positions this would be the case, so
you're missing the boat somewhere!!!!!!!

My advice, hop in and give it a try!

There is another thing which needs to be mentionned when talking
about 4 cpu's or more: that's your load balancing is having a worstcase
speedup from 1.0, which is not a very promising idea. At 4 cpu's that
means that you might use it 25% effective in a worst case scenario.

At 2 cpu's that is 50%.

So 2 cpus is an optimal number of cpu's to run parallel at IMHO.

 a) the worst case is better than that of any other number of processors
    other than n=1
 b) you can profit from parallel luck in cutoffs, move ordering improvements
    etcetera.

Why did i never hear *anyone* about the worst case
speedup its parallel algorithm is going to give?

I mean there are pretty easy load balancing ideas. One of them is
one i'm using myself. That's simply demanding that each splitted leaf
gets 50% of the processors.

So expected worst case is 50% (though there are theoretical possibilities
that it's worse than 50%, the likelyness of that is way way smaller than
with the way crafty is running parallel).

This where expected worst case in crafty is 25%.

>cpu and get the same effect.  And if that is true, it simply means your move
>ordering is so much worse than mine that it doesn't count.

>I am assuming a "decent program" when I say a chess program can't produce a
>speedup of > 2.00 using 2 cpus.
>
>
>>
>>You wrote your parallellism in crafty within 2 or 3 months and then of
>>course some bugfixing. In the meantime you had probably the most busiest
>>part of the year, etc.
>>
>>Others had major problems with parallellism, simply didn't manage, and
>>i can imagine. It is a disaster to get good working parallellism!
>>
>>Some french a look like algorithms are very popular now. Of course
>>2.0 won't happen with these algorithms. Not even 1.7 or 1.8.
>>
>>Other programs which extend a lot (so processors in different search spaces)
>>they report a way smaller speedup than others who use the same method!
>>
>>It all comes down to very simple conclusion which you don't seem to
>>want to make, "because our move ordering is near perfect anyway,"
>>to quote a popular line from some 4 years ago from you!
>
>
>92% is "near perfect".  At least in AI literature circles.  A "strongly
>ordered game tree" is one where the best move is searched first in 90% of
>the positions or higher.
>
>
>>
>>this assumption is however INCORRECT. Move ordering still can give
>>A LOT smaller search trees!
>
>And SMP is _not_ going to affect move ordering.
>
>
>
>>
>>Better eval ==> more move ordering needed
>>
>>>search normally until you would choose to use a second process if you had one.
>>>From this point forward, time-slice between two search threads.  Search one
>>>node on one, one on the other.  That should run twice as fast, since it is
>>>_exactly_ how the program really does on the SMP platform.
>>
>>As you can see the effect only happens after say 15 minutes and i do not
>>have 15 minutes of time.
>>
>>Faster hardware will not reduce the problem either.
>>
>>Note that usually i run with stupid extensions in a game also and then
>>my speedup is then around 1.8 when it is 2.0 or more now and with these
>>extensions i hardly get above 1.8 also because the searchspaces of
>>both processors do not overlap, so my speedup is 2.0 minus slowdown which
>>is 0.2 leads to 1.8 speedup.
>>
>>>And that will _not_ work unless you have either the world's worse sequential
>>>search algorithm, or the world's buggiest parallel search algorithm.
>>
>>How on earth can both Fritz, DIEP have worse search algorithms than you use?
>
>You answer that.  I only know the claim of >2 is nonsense.  From induction
>and logical reasoning.  Write a paper describing your algorithm.  Submit it
>to the ICCA or International Journal on Parallel processing or whatever,
>and see what the referees have to say about it.  If your explanation stands
>up to that scrutiny, I will give it more credibility.  But not at the
>moment...  a speedup of > 2 would make you a small fortune in computing
>circles...
>
>
>
>
>
>>
>>>Before you label a statement as ridiculous, find me _one_ person that will
>>>side with you and say that "yes, it is possible to get a > 2 speedup using
>>>only two cpus on anything but anomaly positions.  Find _one_ person that will
>>>agree...
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>Let me list some things:
>>>>  a) your own move ordering sucks from theoretical viewpoint
>>>>  b) mine sucks even more from theoretical viewpoint
>>>>  c) DIEP needs every ply less nodes than crafty
>>>>  d) DIEP is not recursive
>>>
>>>
>>>All interesting observations.  But totally irrelevant for the argument
>>>at hand, namely producing consistent super-linear speedups.
>>>
>>>
>>>
>>>>  e) DIEP is a slower program than crafty so parallel overhead
>>>>     is less expressed in searchtime
>>>>  f) In diep i always print out the number of nodes for each searchdepth
>>>>     in crafty you do not do this, so i can easily see that at bigger
>>>>     depths i need LESS nodes. You only see search times. No one knows how
>>>>     much overhead is in number of nodes for crafty though.
>>>
>>>
>>>my NPS is constant, so displaying the search time or node count is
>>>providing identical information.
>>>
>>>
>>>
>>>>
>>>>How many nodes do you get to get to the same depths?
>>>
>>>
>>>_always_ more nodes using 2 cpus.  If you don't you have a bug, except that
>>>in rare positions you can get fewer.  Kopec 1 is a good example.  The checking
>>>move is a very short mate.  Other moves are not mates.  If you search the
>>>non-checks first, it takes more nodes.  If you search the check first, it takes
>>>less.  The parallel search might search the check sooner than the sequential
>>>search and find the mate (and adjust alpha/beta bounds) much quicker to shrink
>>>the tree.  But that is an _exception_.  For every one of those, there are 100
>>>that don't behave that way.
>>>
>>>
>>>
>>>>
>>>>You know i only use nulllmove, i do checks in qsearch, i do not use
>>>>futility or any other crap. No other forward pruning!
>>>
>>>
>>>so what, again?  You just told me that it takes you two hours to dig a
>>>hole.  And that if I give you two people, you can dig it in less than one
>>>hour.  That is an amazing thing that violates the conservation of energy in
>>>physics.
>>>
>>>Twice the power can only do twice the work, not 3x or 4x.  _period_.
>>>
>>>
>>>>
>>>>The chance that at a position which is a cut node, that DIEP
>>>>gets a cutoff there that chance is on average 0.85
>>>>
>>>>You  measured crafty to have 0.90
>>>>
>>>>So our move orderings suck.
>>>
>>>
>>>92 is my number.  that does not 'suck'.  Since 100% would be impossible to
>>>do.
>>>
>>>
>>>
>>>
>>>>
>>>>Now imagine you search parallel and as you know my fliprate
>>>>is way less than 1%, in crafty it is 5% or more.
>>>>
>>>>So
>>>>  a) diep is more than crafty researching the same gamespace, explaining
>>>>     of course why i need less nodes for each ply
>>>>
>>>>  b) that means most likely that if i split that the chance that if i split
>>>>     that the chance that DIEP does unnecessary work is smaller than
>>>>     crafty. In fact it's a galaxy smaller!
>>>>
>>>>  c) crafty has way more overhead because it is recursive
>>>
>>>
>>>What does this mean?  I have done searches _both_ ways.  Cray Blitz was
>>>iterative, crafty is recursive.  Overhead is the same.  The only problem is
>>>that in Crafty, I can split at a "current ply" or not at all.  In cray blitz
>>>I could split at a current ply or any ply before those.  The performance
>>>difference is not that different.  And Cray Blitz didn't produce speedups of
>>>> 4 with 4 cpus...
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>Now c) is very important here to realize that where at a bigger iteration
>>>>each processor in DIEP is simply splitting less and less, in crafty
>>>>this is relatively bigger as it needs to help the other processor time
>>>>after time with its search tree. So if we compare depths at where
>>>>we on average split from the root, then we will see very clearly an
>>>>advantage for DIEP.
>>>>
>>>
>>>
>>>Nope, because I split at the root also.  I have seen searches to 15
>>>plies that did exactly 15 splits and no more until right at the end
>>>when they try to help each other finish the last few branches on the last
>>>move searched.
>>>
>>>
>>>
>>>>All this explains easily why my 0.85 improves to a better number when
>>>>i search deeper and why with crafty this is less likely to be the case.
>>>>
>>>>Just the *possibility* that this 0.85 improves already is the easy proof
>>>>that speedups of over 2.0 is what i get at analysis levels.
>>>
>>>
>>>This is simply nonsense.  speedups of > 2.0 are simply bugs that are waiting
>>>to be found.  If they are consistent, even worse.
>>>
>>>
>>>
>>>
>>>>
>>>>To show the other position you tested, here i have a 2.8 speedup at 2
>>>>processors:
>>>>
>>>
>>>
>>>that only shows that you have bugs galore, I am afraid.  Again, find anyone
>>>that will support your theory about > 2.00 speedups using only 2 cpus.  If
>>>that is the case, why can't you get > 1.00 speedup using _one_???
>>>
>>>
>>>
>>>
>>>
>>>>00:01 0 0 81407 (0) 7 0.258 Bc8-f5 Qd4-d2 O-O Nc3-e2 c6-c5 Ne2-g3 Bf5-e4 Ng3xe4
>>>>d5xe4
>>>>00:03 0 0 202108 (0) 8 0.174 Bc8-f5 Qd4-d2 Qd8-a5 Qd2-f2 O-O Be3-d4 Qa5-c7 Ra1-e
>>>>1
>>>>00:19 0 0 1173384 (0) 9 0.317 Bc8-f5 Qd4-d1 O-O Nc3-e2 c6-c5 Ne2-g3 Bf5-e4 f4-f5
>>>> d5-d4 Ng3xe4 d4xe3
>>>>00:28 0 0 1767448 (0) 10 0.155 Bc8-f5 Qd4-d1 O-O Nc3-e2 Qd8-d7 Ne2-d4 Bf5-e4 f4-
>>>>f5 c6-c5 e5-e6 f7xe6 f5xe6
>>>>01:35 0 0 6079251 (0) 11 0.237 Bc8-f5 Qd4-d1 O-O Nc3-e2 Bf5-e4 Ne2-g3 Rf8-e8 Qd1
>>>>-d2 Qd8-c7 Ra1-d1 Re8-d8
>>>>03:26 0 0 13042725 (0) 12 0.070 Bc8-f5 Qd4-d1 Qd8-a5 Nc3-e2 Bf5-e4 Ne2-d4 O-O Qd
>>>>1-h5 Qa5-b4 Rf1-b1 Be7-c5 c2-c3
>>>>++ b7-b6
>>>>24:22 0 0 84331794 (0) 12 0.079 b7-b6 Nc3-e2 Bc8-a6 Qd4-d2 O-O Rf1-d1 c6-c5 Be3-
>>>>f2 Qd8-d7 Ne2-g3 d5-d4 c2-c3
>>>>48:18 0 0 169226934 (0) 13 0.131 b7-b6 Qd4-d3 a7-a5 Nc3-e2 Bc8-a6 Qd3-d1 O-O Rf1
>>>>-e1 f7-f6 Ne2-d4 Qd8-c8 e5-e6 Be7-d6
>>>>++ c8-f5
>>>>50:54 0 0 178647431 (0) 13 0.194 Bc8-f5 Qd4-d1 O-O Nc3-e2 Qd8-d7 Qd1-d2 Bf5-e4 N
>>>>e2-g3 Rf8-d8 f4-f5 Qd7-c8 Be3-d4
>>>>01:06:07 0 0 232264385 (0) 14 0.075 Bc8-f5 Qd4-d1 O-O Nc3-e2 Bf5-g4 Qd1-d3 f7-f6
>>>> h2-h3 Bg4-d7 a4-a5 f6xe5 f4xe5 Qd8-c7 a5-a6
>>>>
>>>>00:01 0 0 83514 (7267) 7 0.258 Bc8-f5 Qd4-d2 O-O Nc3-e2 c6-c5 Ne2-g3 Bf5-e4 Ng3x
>>>>e4 d5xe4
>>>>00:02 0 0 207403 (63479) 8 0.174 Bc8-f5 Qd4-d2 Qd8-a5 Qd2-f2 O-O Be3-d4 Qa5-c7 R
>>>>a1-e1
>>>>00:11 0 0 1019046 (236794) 9 0.317 Bc8-f5 Qd4-d1 O-O Nc3-e2 c6-c5 Ne2-g3 Bf5-e4
>>>>f4-f5 d5-d4 Ng3xe4 d4xe3
>>>>00:17 0 0 1750547 (263775) 10 0.254 Bc8-f5 Qd4-d1 O-O Nc3-e2 Bf5-e4 c2-c3 c6-c5
>>>>Ne2-g3 Qd8-b6 Qd1-e2
>>>>00:41 0 0 4840190 (582946) 11 0.237 Bc8-f5 Qd4-d1 O-O Nc3-e2 Bf5-e4 Ne2-g3 Rf8-e
>>>>8 Qd1-d2 Qd8-c7 Ra1-d1 Re8-d8
>>>>02:18 0 0 16340868 (3553884) 12 0.070 Bc8-f5 Qd4-d1 Qd8-a5 Nc3-e2 Bf5-e4 Ne2-d4
>>>>O-O Qd1-h5 Qa5-b4 Rf1-b1 Be7-c5 c2-c3
>>>>++ b7-b6
>>>>10:41 0 0 72051608 (11640625) 12 0.079 b7-b6 Qd4-d2 O-O Nc3-e2 Bc8-a6 Rf1-d1 c6-
>>>>c5 Be3-f2 Qd8-d7 Ne2-g3 d5-d4 c2-c3
>>>>quit
>>>>
>>>>
>>>>>IE if you can go 4x faster with 2 processors, you can go 16 x faster with
>>>>>4 processors.  That I don't believe as I have _never_ seen anyone else report
>>>>>such a number and not later find serious bugs.
>>>>
>>>>I do not have parallel bugs i am very sure of that. A good proof is
>>>>that i nearly always get the same score and mainline in both parallel
>>>>and single cpu searches. How's crafty here?
>>>
>>>That doesn't say a thing about bugs.  Crafty reproduces most of the time,
>>>but not _all_ of the time.  Which theory predicts, by the way.
>>>
>>>
>>>
>>>
>>>>
>>>>I turned off the dangerous extensions in this test too, so that
>>>>dubious things which are alfabeta dependant did not influence the test
>>>>and that i would search more in the same search space. When i turn them
>>>>on just like crafty the both processors search less in the same space
>>>>and my speedup hardly gets over 1.8 then.
>>>>
>>>>Note that you use things like futility pruning bigtime in crafty.
>>>>This is very easily shown dubious, but it obviously means that you
>>>>get deeper quicker initially. Perhaps at bigger depths it is payback
>>>>time!
>>>
>>>
>>>I only do futility pruning in the q-search.  I did it exactly the same way
>>>in Cray Blitz also (tossing losing SEE captures).
>>>
>>>
>>>
>>>
>>>>
>>>>The logica that i would be 16 times faster on 4 processors is not true
>>>>of course.
>>>>
>>>>At 4 processors my overhead is way way bigger than at 2
>>>>processors to start with.
>>>>
>>>
>>>
>>>You can't have it both ways.  4.0 x with 2 has to translate to 16x with 4,
>>>otherwise we are not in base 10 math.
>>>
>>>
>>>
>>>
>>>
>>>>Also i limit the number of processors at each node
>>>>to at most to 2. I want to cutoff quicker without risking that 75% of
>>>>my cpu's do superfluous work.
>>>>So then as a result of that 2 limitation is that it is not possible
>>>>to ideally split at 4 processors like it is possible with 2.
>>>>
>>>>Last but least, a factor 2.8 or 3.5 at 2 processors is a very good
>>>>speedup, but that's not near the magnitude of speedup of 16 at 4
>>>>processors.
>>>
>>>3.5 at 2 is 14 at 4.  That is still an impossible number...
>>>
>>>You can wave your hands all you want, but 1+1 is _still_ <= 2, never
>>>greater.  To think otherwise simply defies logic and common sense.
>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>>>
>>>>>>Note that the above diep version i used is a version *without*
>>>>>>dangerous extensions.
>>>>>>
>>>>>>>In each of the two positions tested above, I ran the thing with one
>>>>>>>processor and then with 4 processors.  Everything else
>>>>>>>remained exactly the same.  The first position is kopec 22,
>>>>>>>the second is kopec23.  Chosen at random for no good reason.
>>>>>>
>>>>>>Easy move ordering in the first position and score is not
>>>>>>slowly going down so i'm not so sure whether
>>>>>>it's smart to use this own, but here the results from DIEP speak
>>>>>>for itself. Also at one of the positions you tried you get a speedup increase.
>>>>>>
>>>>>>Now i'll be the last to say that the worst case speedup from crafty at 4
>>>>>>processors is very good. But definitely there is evidence for you to
>>>>>>keep testing also but not with APHID :)
>>>>>>
>>>>>>>I can do perfectly fine with very short searches.  Or very long searches.
>>>>>>
>>>>>>I find 2.75 at short search versus 4.0 at longer search very clearly
>>>>>>not what you conclude.
>>>>>
>>>>>
>>>>>basically 3x vs 4x.  I have some start-up costs for single-position searches
>>>>>that makes this look worse than it really is.  I start the 3 extra threads
>>>>>_after_ the search clock is started, so that I get "billed" for creating three
>>>>>new processes.  Once they are created, they will last for the whole game, but
>>>>>in this case, the whole game is one move.  The cost hurts.  I could have run a
>>>>>dummy problem first to get them started, but it wasn't worth the trouble.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>But remember that you and I are not splitting the tree in the same way,
>>>>>>>because I can split at the root when it is reasonable to do so.  that
>>>>>>>helps the speedup quite a bit.
>>>>>>
>>>>>>It doesn't work for me and i find it very dangerous. Here in this position
>>>>>>the ply after you have found Bxe4 this is not a big problem of course.
>>>>>
>>>>>Splitting at the root is not dangerous.  In fact, it is the _optimal_ place
>>>>>to split after you have searched the first move.  You get _zero_ search
>>>>>overhead due to bad move ordering.  Make that 0.00...
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>I don't want to crack down results from the APHID team back in 1996,
>>>>>>>>but we must clearly realize that what worked back then is NOT going to
>>>>>>>>work in 2001 anymore!
>>>>>>>
>>>>>>>Again, would you care to make a small wager?  They got about 33% efficiency
>>>>>>>on their hardware.  I think that can be improved on with some more work.  They
>>>>>>>only wanted to get it to work, not work very efficiently.  And I think that
>>>>>>>50% efficiency is doable.  And on my giganet hardware I think that well beyond
>>>>>>>50% is very doable.
>>>>>>
>>>>>>33% with 8 ply searches in 1996 with a program from 2500 nodes a second at
>>>>>>16 expensive workstations, i find that very poor!
>>>>>>
>>>>>>Even if you throw over the ball by hand you can get a better speedup at
>>>>>>8 ply with that a slow speed!
>>>>>>
>>>>>>I got single cpu end 1996 at a pentiumpro200 already 5000-10000 nodes
>>>>>>a second, and i'm pretty sure that DIEP back in 1996 also was not a very
>>>>>>fast program, they needed 16 very expensive
>>>>>>workstations to get 2500 nodes a second in total!
>>>>>
>>>>>I haven't looked at their data.  But I have worked with parallel applications
>>>>>for a long time.  And I claim to understand the ins and outs of alpha/beta
>>>>>search and parallel issues pretty well.  I don't see a _thing_ that would
>>>>>suggest that their speedup won't hold on today's hardware.  I have had a
>>>>>couple of students do parallel othello programs on a distributed cluster.  They
>>>>>did pretty well...
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>If i would slow down diep factor 100 or something, then i'm sure that i
>>>>>>can even get a speedup over the internet with it at 16 nodes better than
>>>>>>what these guys showed!
>>>>>
>>>>>No it isn't.  They used whatever hardware they had at the time.  If you
>>>>>ramp up the CPU speed, _and_ the network speed, things tend to scale
>>>>>uniformly if the two speeds increase uniformly.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>This refers to the fact that YBW for each node needs a result for the
>>>>>>>>>>first move first before splitting other moves. Now this is of course very
>>>>>>>>>>true, but this directly also shows that APHIDs branching factor is by
>>>>>>>>>>definition bigger than algorithms which use YBW.
>>>>>>>>>
>>>>>>>>>I don't think they make any claims that their branching factor is better
>>>>>>>>>than YBW. Would seem pretty silly to me. But then again, YBW probably
>>>>>>>>>gets killed in different ways when having to deal with slow interprocess
>>>>>>>>>comms on a cluster.
>>>>>>>
>>>>>>>YBW doesn't exclude clusters.  In a tree search, the number of "split"
>>>>>>>operations has to be constrained so that the overhead of splitting doesn't
>>>>>>>dominate the work done in actually searching the tree.  I control this in
>>>>>>>Crafty by limiting how near the tips of the branches I can split the tree.
>>>>>>
>>>>>>Yes that's true, but you might underestimate something very important
>>>>>>here. You use nullmove bigtime in crafty. That means that missing
>>>>>>transpositiontable lookups from other processes last few plies means that
>>>>>>you definitely have a problem.
>>>>>
>>>>>
>>>>>It isn't _that_ big of a problem. I ran that test for you a long while back
>>>>>about not hashing in the last N plies of the search.  It didn't kill me nearly
>>>>>as badly as you were saying it would.  And in some cases it probably makes
>>>>>sense, in fact.  You already do that if you store "depth-preferred" as the
>>>>>last few plies can't overwrite all those deeper plies.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>Also another small thing is that each processor has a local hashtable;
>>>>>>Rainer Feldmann said: "let's see all the RAM of each node as one big
>>>>>>distributed transpositiontable".
>>>>>>
>>>>>>Now that means that his hashtable was n times bigger in his approach
>>>>>>than might be the case in your current plans!
>>>>>
>>>>>I plan on doing shared hash too.  At least at early plies in the tree.
>>>>>Near the tips I might do local-only, I don't know.
>>>>>
>>>>>>
>>>>>>>For a distributed approach, I intend on a second bound, which
>>>>>>>limits distributed
>>>>>>>splits even more severely, because the overhead is going to be higher in net
>>>>>>>communication/latency.
>>>>>>
>>>>>>In that case you cannot distribute hashtable even. Also the branching factor
>>>>>>is going to be huge and the time needed to split a subtree is going to be
>>>>>>seconds.
>>>>>
>>>>>I can split a tree with one packet.  It won't take seconds.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>To avoid loads of idle time this would mean that a search is going to need
>>>>>>tens of minutes otherwise the relation between idling and searching is
>>>>>>going to be bigtime in the advantage of the idling.
>>>>>
>>>>>That is nonsense.  If I can guarantee that a piece of work will take at least
>>>>>one second, the overhead will be minimal.  Because I can send a big packet in
>>>>>1/100th of a second using simple 100mbit ethernet.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>I think the difference between myself and Vincent is that I am in the business
>>>>>>>of trying to find out _how_ to accomplish something, no matter how difficult it
>>>>>>>might seem at first glance.  I don't assume everything is impossible if I
>>>>>>>haven't done it myself.  I assume everything is possible, until I prove that it
>>>>>>>can't be done by trying every possible alternative I can think of.  IE I would
>>>>>>>rather drive a nail using a hammer.  But I have done it using pliers, a wrench,
>>>>>>>a rock, a screwdriver handle, and even a C-clamp.
>>>>>>
>>>>>>Well i'm from europe and i we're more careful here,
>>>>>>i already have done huge calculations and many network
>>>>>>experiments on whether it was possible
>>>>>>  a) to get diep working over the network (100mbits) with a very limited
>>>>>>     number of nodes (n=2)
>>>>>>  b) possibility to not have a completely shared hashtable
>>>>>>  c) distribution over the net
>>>>>>
>>>>>>Of course c) i didn't even start experimenting with because the 100mbits
>>>>>>already proved not enough for nowadays search depths. Also the big bummers
>>>>>>for distributed DIEP was the big system time wasted to network communication.
>>>>>>
>>>>>>On paper it is little, in reality this slows down a program X times.
>>>>>>
>>>>>>If you don't slow down the problem is branching factor as i clearly
>>>>>>can show here. My speedup is 2.0 or more at bigger depths.
>>>>>>
>>>>>>My speedup sucks at 5-8 ply. At 8 ply in fact the speedup here reads 0.00
>>>>>>
>>>>>>Speedup at bigger depths 2.0 though.
>>>>>
>>>>>That is just a shortcoming of _your_ algorithm.  I don't have that problem
>>>>>at 8 plies.
>>>>>
>>>>>And that is a danger of drawing conclusions about _all_ programs when looking
>>>>>just at yours.  Just hold on a while... we will continue this once I have a
>>>>>distributed version up.  Then we can compare notes to see whether I get that
>>>>>33% (or higher) speed up or not.
>>>>>
>>>>>>
>>>>>>>It just takes a bit more
>>>>>>>thought and work, but it isn't impossible.  Now nails are being driven by
>>>>>>>compressed air, EM impulse, blanks in a gun using the nail as the projectile,
>>>>>>>etc.  All probably thought "impossible" 40 years ago.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>--
>>>>>>>>>GCP



This page took 0.08 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.