Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Table statement ** Please a beginner's question for a change

Author: Robert Hyatt

Date: 12:11:49 09/05/02

Go up one level in this thread


On September 05, 2002 at 14:37:37, Eugene Nalimov wrote:

>Here is (probably oversimplified) explanation.
>
>Alpha-beta is a very sequental algorithm. Here are 2 reasons which made it very
>hard to parallelize it well:
>
>(1) You are using information obtained during the search of the first move when
>you are searching the next ones to search them faster. As a result if you'll
>start search them before you finished first one, you'll not be able to use that
>information, and you'll search more nodes. I.e. on a dual machine you can search
>1.9x more nodes, but part of them are the extra nodes that would not be searched
>by the sequental program, so actual speedup would be even less.
>
>(2) In a lot of positions (including root) you are spending 50% (or even more)
>of time searching the first [best] move. If you'll not start using 2nd CPU till
>you searched the first move, than your speedup on the dual system would be not
>great. You'll still use 50% of total time searching the first move, and even
>assuming that you can search remaining moves 2 times faster, on dual system you
>will use 75% of the one-CPU search, i.e. you would get 1.33x speedup in the best
>case.
>
>So, when implementing parallel search, you should start using 2nd CPU early
>enough, otherwise your speedup would be bad because first CPU will spend large
>amount of time while 2nd CPU is idle. OTOH if you'll start to use it too early,
>its' search will be less efficient (compared to the sequental version), it will
>search extra nodes, and your speedup still will be bad.
>
>And don't forget that while instructing the 2nd CPU what to search, 1st CPU is
>spending its time, instead of doing more useful work (i.e. instead of
>searching). That's the infamous 3k that are regularly mentioned in that thread
>-- when staring the search using another CPU (splitting the tree) Crafty is
>copying ~3k of data. If you'll do it too often that will kill your performance,
>as data is going through slow bus to memory and another CPUs.
>
>To complicate the things further, there is another reason of slowdown: several
>CPUs share system bus, memory, etc. I.e. sometimes one CPU have to wait for
>resources that other CPU is currently using, even if they are performing
>independent tasks. That's why "raw nps" would increase less than 2x on the dual
>system, regardless of the efficiency of algorithm used in the chess program.
>Exact amount of that that slowdown depend on the type of CPU used, chipset, etc.
>Here server CPUs and chipsets are the best, as they were designed with
>scalability in mind. For example, I saw 1.98x speedup on an expensive 64-bit
>system. And here AMD dual systems are not very good: for them typical "raw nps"
>speedup is much less than for Intel systems. In theory, if that slowdown is bad
>enough, you can see that dual system would be always slower than single-CPU: due
>to inefficiency of the parallel search it would search more nodes, and each
>noder would be searched slower...
>
>Thanks,
>Eugene


I should add that my idea about "root splitting" does not contradict the
above.  IE if I do a 10 ply search, I first split at 10, then at 9,
eventually back at 8, and so forth.  Some stop at 2, but I don't and I
can split at one if I so choose.  And since by that time I have searched
the presumably best move, I have the alpha bound established and searching
the rest of the root moves in parallel produces _zero_ search overhead, which
is a key point...







>
>On September 05, 2002 at 11:14:17, Rolf Tueschen wrote:
>
>>On September 05, 2002 at 10:53:22, Vincent Diepeveen wrote:
>>
>>>On September 04, 2002 at 14:29:05, Robert Hyatt wrote:
>>>
>>>you get even with fast memory on your quad only a 2.8 speedup
>>>on average at 30 positions.
>>
>>1. May I politely suggest to you that you stop this tradition of writing your
>>new remarks below the top line of the quotes "someone wrote..."?
>>
>>2. Here is a basic question from my ignorance and for all the young readers here
>>who wouldn't ask such questions which prove that someone hasn't understood
>>something.
>>
>>Could either Bob or Vincent explain why sometimes the duel is about below 2. and
>>then here above even 2.8 should be nothing special following Vincent?
>>
>>What is this number standing for is it about the advantage of a n+1 processor
>>number "over" n? And second, what is the point of significant advantage? Greater
>>than 2.? And all what is below 2, this is weak progress?
>>
>>3. Is 2. or 1.97 a factor of time, NPS or what?
>>
>>4. This split at the root thing, such a refinement I know from forest sciences,
>>but here, what does it mean, and most of all, how many chapters of CC I'm away
>>before I could understand such a term? And how does this topic influence the
>>question of magnitude of advantage through processor number? You see how
>>confusing that is...
>>
>>
>>I'm the real ignorant here and I still have the guts to admit it and to ask
>>these questions. Simply because otherwise I could never hope to improve.
>>
>>
>>Thank you!
>>
>>Rolf Tueschen
>>
>>
>>
>>>
>>>that's a lot more than a run on 1 position.
>>>
>>>>On September 04, 2002 at 14:02:54, Vincent Diepeveen wrote:
>>>>
>>>>>On September 04, 2002 at 12:43:37, Robert Hyatt wrote:
>>>>>
>>>>>Please run crafty at 16 processors. Fine with me.
>>>>>Even though it's a different program. I have no problems
>>>>>with it.
>>>>
>>>>And what would be the point?  I might give you some 16 processor
>>>>numbers on a NUMA machine before long.  I _might_.
>>>>
>>>>
>>>>>
>>>>>But rewrite also the article then that it's not a DTS thing,
>>>>>but a smp_lock thing that doesn't scale above 8 cpu's.
>>>>
>>>>
>>>>Vincent, the smp_lock thing doesn't hurt me thru 16 cpus as I already
>>>>know.  I don't understand why you don't follow this, but in a typical
>>>>3 minute search, I see numbers like this:
>>>>
>>>>              time=3:29  cpu=399%  mat=0  n=303284136  fh=89%  nps=1450k
>>>>              ext-> chk=4663926 cap=1175890 pp=230533 1rep=74539 mate=3299
>>>>              predicted=2  nodes=303284136  evals=99342268
>>>>              endgame tablebase-> probes done=0  successful=0
>>>>              SMP->  split=774  stop=133  data=14/64  cpu=13:55  elap=3:29
>>>>
>>>>That is from a real game played on ICC.
>>>>
>>>>Note it only did 774 splits.  that is 774 smp_locks.  Do you _really_ think
>>>>that hurts performance?  _really_?
>>>>
>>>>If so, I have this bridge I need to get rid of...
>>>>
>>>>You can say smp_lock is a problem all you want.  You can say that it killed
>>>>you on a NUMA machine all you want.  But that doesn't mean it kills _me_
>>>>on 8 or 16 processors...
>>>>
>>>>
>>>>BTW I would hate to publish 16 cpu crafty numbers, because that would probably
>>>>give you _another_ problem to overcome with your "sponsors".  :)



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.