Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Table statement ** Please a beginner's question for a change

Author: Eugene Nalimov
Date: 11:37:37 09/05/02
Here is (probably oversimplified) explanation.

Alpha-beta is a very sequental algorithm. Here are 2 reasons which made it very
hard to parallelize it well:

(1) You are using information obtained during the search of the first move when
you are searching the next ones to search them faster. As a result if you'll
start search them before you finished first one, you'll not be able to use that
information, and you'll search more nodes. I.e. on a dual machine you can search
1.9x more nodes, but part of them are the extra nodes that would not be searched
by the sequental program, so actual speedup would be even less.

(2) In a lot of positions (including root) you are spending 50% (or even more)
of time searching the first [best] move. If you'll not start using 2nd CPU till
you searched the first move, than your speedup on the dual system would be not
great. You'll still use 50% of total time searching the first move, and even
assuming that you can search remaining moves 2 times faster, on dual system you
will use 75% of the one-CPU search, i.e. you would get 1.33x speedup in the best
case.

So, when implementing parallel search, you should start using 2nd CPU early
enough, otherwise your speedup would be bad because first CPU will spend large
amount of time while 2nd CPU is idle. OTOH if you'll start to use it too early,
its' search will be less efficient (compared to the sequental version), it will
search extra nodes, and your speedup still will be bad.

And don't forget that while instructing the 2nd CPU what to search, 1st CPU is
spending its time, instead of doing more useful work (i.e. instead of
searching). That's the infamous 3k that are regularly mentioned in that thread
-- when staring the search using another CPU (splitting the tree) Crafty is
copying ~3k of data. If you'll do it too often that will kill your performance,
as data is going through slow bus to memory and another CPUs.

To complicate the things further, there is another reason of slowdown: several
CPUs share system bus, memory, etc. I.e. sometimes one CPU have to wait for
resources that other CPU is currently using, even if they are performing
independent tasks. That's why "raw nps" would increase less than 2x on the dual
system, regardless of the efficiency of algorithm used in the chess program.
Exact amount of that that slowdown depend on the type of CPU used, chipset, etc.
Here server CPUs and chipsets are the best, as they were designed with
scalability in mind. For example, I saw 1.98x speedup on an expensive 64-bit
system. And here AMD dual systems are not very good: for them typical "raw nps"
speedup is much less than for Intel systems. In theory, if that slowdown is bad
enough, you can see that dual system would be always slower than single-CPU: due
to inefficiency of the parallel search it would search more nodes, and each
noder would be searched slower...

Thanks,
Eugene

On September 05, 2002 at 11:14:17, Rolf Tueschen wrote:

>On September 05, 2002 at 10:53:22, Vincent Diepeveen wrote:
>
>>On September 04, 2002 at 14:29:05, Robert Hyatt wrote:
>>
>>you get even with fast memory on your quad only a 2.8 speedup
>>on average at 30 positions.
>
>1. May I politely suggest to you that you stop this tradition of writing your
>new remarks below the top line of the quotes "someone wrote..."?
>
>2. Here is a basic question from my ignorance and for all the young readers here
>who wouldn't ask such questions which prove that someone hasn't understood
>something.
>
>Could either Bob or Vincent explain why sometimes the duel is about below 2. and
>then here above even 2.8 should be nothing special following Vincent?
>
>What is this number standing for is it about the advantage of a n+1 processor
>number "over" n? And second, what is the point of significant advantage? Greater
>than 2.? And all what is below 2, this is weak progress?
>
>3. Is 2. or 1.97 a factor of time, NPS or what?
>
>4. This split at the root thing, such a refinement I know from forest sciences,
>but here, what does it mean, and most of all, how many chapters of CC I'm away
>before I could understand such a term? And how does this topic influence the
>question of magnitude of advantage through processor number? You see how
>confusing that is...
>
>
>I'm the real ignorant here and I still have the guts to admit it and to ask
>these questions. Simply because otherwise I could never hope to improve.
>
>
>Thank you!
>
>Rolf Tueschen
>
>
>
>>
>>that's a lot more than a run on 1 position.
>>
>>>On September 04, 2002 at 14:02:54, Vincent Diepeveen wrote:
>>>
>>>>On September 04, 2002 at 12:43:37, Robert Hyatt wrote:
>>>>
>>>>Please run crafty at 16 processors. Fine with me.
>>>>Even though it's a different program. I have no problems
>>>>with it.
>>>
>>>And what would be the point?  I might give you some 16 processor
>>>numbers on a NUMA machine before long.  I _might_.
>>>
>>>
>>>>
>>>>But rewrite also the article then that it's not a DTS thing,
>>>>but a smp_lock thing that doesn't scale above 8 cpu's.
>>>
>>>
>>>Vincent, the smp_lock thing doesn't hurt me thru 16 cpus as I already
>>>know.  I don't understand why you don't follow this, but in a typical
>>>3 minute search, I see numbers like this:
>>>
>>>              time=3:29  cpu=399%  mat=0  n=303284136  fh=89%  nps=1450k
>>>              ext-> chk=4663926 cap=1175890 pp=230533 1rep=74539 mate=3299
>>>              predicted=2  nodes=303284136  evals=99342268
>>>              endgame tablebase-> probes done=0  successful=0
>>>              SMP->  split=774  stop=133  data=14/64  cpu=13:55  elap=3:29
>>>
>>>That is from a real game played on ICC.
>>>
>>>Note it only did 774 splits.  that is 774 smp_locks.  Do you _really_ think
>>>that hurts performance?  _really_?
>>>
>>>If so, I have this bridge I need to get rid of...
>>>
>>>You can say smp_lock is a problem all you want.  You can say that it killed
>>>you on a NUMA machine all you want.  But that doesn't mean it kills _me_
>>>on 8 or 16 processors...
>>>
>>>
>>>BTW I would hate to publish 16 cpu crafty numbers, because that would probably
>>>give you _another_ problem to overcome with your "sponsors".  :)
Re: Table statement ** Please a beginner's question for a change Rolf Tueschen 09:40:41 09/06/02
Re: Table statement ** Please a beginner's question for a change Robert Hyatt 12:11:49 09/05/02
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.