Author: Robert Hyatt
Date: 12:11:49 09/05/02
Go up one level in this thread
On September 05, 2002 at 14:37:37, Eugene Nalimov wrote: >Here is (probably oversimplified) explanation. > >Alpha-beta is a very sequental algorithm. Here are 2 reasons which made it very >hard to parallelize it well: > >(1) You are using information obtained during the search of the first move when >you are searching the next ones to search them faster. As a result if you'll >start search them before you finished first one, you'll not be able to use that >information, and you'll search more nodes. I.e. on a dual machine you can search >1.9x more nodes, but part of them are the extra nodes that would not be searched >by the sequental program, so actual speedup would be even less. > >(2) In a lot of positions (including root) you are spending 50% (or even more) >of time searching the first [best] move. If you'll not start using 2nd CPU till >you searched the first move, than your speedup on the dual system would be not >great. You'll still use 50% of total time searching the first move, and even >assuming that you can search remaining moves 2 times faster, on dual system you >will use 75% of the one-CPU search, i.e. you would get 1.33x speedup in the best >case. > >So, when implementing parallel search, you should start using 2nd CPU early >enough, otherwise your speedup would be bad because first CPU will spend large >amount of time while 2nd CPU is idle. OTOH if you'll start to use it too early, >its' search will be less efficient (compared to the sequental version), it will >search extra nodes, and your speedup still will be bad. > >And don't forget that while instructing the 2nd CPU what to search, 1st CPU is >spending its time, instead of doing more useful work (i.e. instead of >searching). That's the infamous 3k that are regularly mentioned in that thread >-- when staring the search using another CPU (splitting the tree) Crafty is >copying ~3k of data. If you'll do it too often that will kill your performance, >as data is going through slow bus to memory and another CPUs. > >To complicate the things further, there is another reason of slowdown: several >CPUs share system bus, memory, etc. I.e. sometimes one CPU have to wait for >resources that other CPU is currently using, even if they are performing >independent tasks. That's why "raw nps" would increase less than 2x on the dual >system, regardless of the efficiency of algorithm used in the chess program. >Exact amount of that that slowdown depend on the type of CPU used, chipset, etc. >Here server CPUs and chipsets are the best, as they were designed with >scalability in mind. For example, I saw 1.98x speedup on an expensive 64-bit >system. And here AMD dual systems are not very good: for them typical "raw nps" >speedup is much less than for Intel systems. In theory, if that slowdown is bad >enough, you can see that dual system would be always slower than single-CPU: due >to inefficiency of the parallel search it would search more nodes, and each >noder would be searched slower... > >Thanks, >Eugene I should add that my idea about "root splitting" does not contradict the above. IE if I do a 10 ply search, I first split at 10, then at 9, eventually back at 8, and so forth. Some stop at 2, but I don't and I can split at one if I so choose. And since by that time I have searched the presumably best move, I have the alpha bound established and searching the rest of the root moves in parallel produces _zero_ search overhead, which is a key point... > >On September 05, 2002 at 11:14:17, Rolf Tueschen wrote: > >>On September 05, 2002 at 10:53:22, Vincent Diepeveen wrote: >> >>>On September 04, 2002 at 14:29:05, Robert Hyatt wrote: >>> >>>you get even with fast memory on your quad only a 2.8 speedup >>>on average at 30 positions. >> >>1. May I politely suggest to you that you stop this tradition of writing your >>new remarks below the top line of the quotes "someone wrote..."? >> >>2. Here is a basic question from my ignorance and for all the young readers here >>who wouldn't ask such questions which prove that someone hasn't understood >>something. >> >>Could either Bob or Vincent explain why sometimes the duel is about below 2. and >>then here above even 2.8 should be nothing special following Vincent? >> >>What is this number standing for is it about the advantage of a n+1 processor >>number "over" n? And second, what is the point of significant advantage? Greater >>than 2.? And all what is below 2, this is weak progress? >> >>3. Is 2. or 1.97 a factor of time, NPS or what? >> >>4. This split at the root thing, such a refinement I know from forest sciences, >>but here, what does it mean, and most of all, how many chapters of CC I'm away >>before I could understand such a term? And how does this topic influence the >>question of magnitude of advantage through processor number? You see how >>confusing that is... >> >> >>I'm the real ignorant here and I still have the guts to admit it and to ask >>these questions. Simply because otherwise I could never hope to improve. >> >> >>Thank you! >> >>Rolf Tueschen >> >> >> >>> >>>that's a lot more than a run on 1 position. >>> >>>>On September 04, 2002 at 14:02:54, Vincent Diepeveen wrote: >>>> >>>>>On September 04, 2002 at 12:43:37, Robert Hyatt wrote: >>>>> >>>>>Please run crafty at 16 processors. Fine with me. >>>>>Even though it's a different program. I have no problems >>>>>with it. >>>> >>>>And what would be the point? I might give you some 16 processor >>>>numbers on a NUMA machine before long. I _might_. >>>> >>>> >>>>> >>>>>But rewrite also the article then that it's not a DTS thing, >>>>>but a smp_lock thing that doesn't scale above 8 cpu's. >>>> >>>> >>>>Vincent, the smp_lock thing doesn't hurt me thru 16 cpus as I already >>>>know. I don't understand why you don't follow this, but in a typical >>>>3 minute search, I see numbers like this: >>>> >>>> time=3:29 cpu=399% mat=0 n=303284136 fh=89% nps=1450k >>>> ext-> chk=4663926 cap=1175890 pp=230533 1rep=74539 mate=3299 >>>> predicted=2 nodes=303284136 evals=99342268 >>>> endgame tablebase-> probes done=0 successful=0 >>>> SMP-> split=774 stop=133 data=14/64 cpu=13:55 elap=3:29 >>>> >>>>That is from a real game played on ICC. >>>> >>>>Note it only did 774 splits. that is 774 smp_locks. Do you _really_ think >>>>that hurts performance? _really_? >>>> >>>>If so, I have this bridge I need to get rid of... >>>> >>>>You can say smp_lock is a problem all you want. You can say that it killed >>>>you on a NUMA machine all you want. But that doesn't mean it kills _me_ >>>>on 8 or 16 processors... >>>> >>>> >>>>BTW I would hate to publish 16 cpu crafty numbers, because that would probably >>>>give you _another_ problem to overcome with your "sponsors". :)
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.