Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: new questions (scott gasch!)

Author: Scott Gasch
Date: 15:10:38 11/19/04
On November 19, 2004 at 17:07:15, Robert Hyatt wrote:

>On November 19, 2004 at 16:09:44, Scott Gasch wrote:
>
>>On November 19, 2004 at 15:50:23, Anthony Cozzie wrote:
>>
>>>On November 19, 2004 at 15:34:32, Scott Gasch wrote:
>>>
>>>>On November 19, 2004 at 15:06:27, Anthony Cozzie wrote:
>>>>
>>>>>On November 19, 2004 at 14:34:59, Scott Gasch wrote:
>>>>>
>>>>>>On November 19, 2004 at 11:28:34, Anthony Cozzie wrote:
>>>>>>
>>>>>>>Well, I also did a pretty complicated algorithm, and I designed my own.  If you
>>>>>>>borrowed someone else's things might go easier for you, although I think that
>>>>>>>takes a lot of the fun out of it. It is also possible you are simply more
>>>>>>>talented than I am at this.   However, I wouldn't say that it works until you
>>>>>>>have run it at a dual for a week or so with no problems.  I *thought* I had mine
>>>>>>>working after about two weeks of implementation :)  I think maybe now I'm down
>>>>>>>to only 1 bug that seems to only occur on ICC.
>>>>>>>
>>>>>>>The next task for you is to measure the speedup you get.  BTW, I do have to take
>>>>>>>back my comment about ABDADA.  After talking with some people, it seems like
>>>>>>>ABDADA gives a _really_ poor speedup (<1.3 or so, which is barely worth doing).
>>>>>>>Of course, Scott's algorithm probably works better.  Scott, what kind of speedup
>>>>>>>do you get?
>>>>>>>
>>>>>>>anthony
>>>>>>
>>>>>>Well the way I did it was to make the engine able to search in parallel and then
>>>>>>pick a likely all-node to split the tree.  I do not split at the root and I do
>>>>>>not split when there is not enough depth left in the tree to justify it.  Other
>>>>>>than that anytime I'm on a PV node and have searched the first move and it
>>>>>>failed low or anytime I'm on a non-PV node and the first N moves fail low (N for
>>>>>>me = 3 but this is obviously adjustable) I split... assuming there is a split
>>>>>>node available and an idle thread to help search.  This works pretty well, I'd
>>>>>>say on average about 5-10% of my split nodes end up failing high.
>>>>>>
>>>>>>I've only tested on my dual proc 1.5ghz Athlon but I see a very good speedup
>>>>>>most of the time... When I was measuring and coding this I was seeing 1.8x to
>>>>>>1.9x on a dual proc.  My initial speedups were more on the order of about 1.4x
>>>>>>or 1.5x but this was due to hash table contention and pawn hash contention (see
>>>>>>below).  One of these days I will borrow a bigger machine at work and benchmark
>>>>>>there.  If I do that I'll post results here.
>>>>>>
>>>>>>I have heard of some people not running into contention in the hash table when
>>>>>>going multi-threaded.  For example one friend of mine did a MT search and ended
>>>>>>up just locking the whole table when one thread was accessing it... and not
>>>>>>seeign any slow down.  This was not my experience at all and so I end up locking
>>>>>>the hash table in 1/64th of the hash table at a time.  As I scaled up the number
>>>>>>of CPUs searching I suspect this would have to become 1/128th or 1/256th.  Also
>>>>>>I have a multi-probe algorithm in the pawn hash table that I suspect would not
>>>>>>scale out very well with N CPUs.  One solution to this I think would be to have
>>>>>>a separate pawn hash table in each thread.
>>>>>>
>>>>>>Scott
>>>>>>Scott
>>>>>
>>>>>So something fairly similar to Crafty?  Do you have a recursive or iterative
>>>>>search?  Do you require the thread that initiated the split to return, or can
>>>>>any thread return from a splitpoint?  How does it synchronize after aborts?
>>>>>
>>>>>anthony
>>>>
>>>>I don't know how Crafty works.  I know Bob was kind enough to send me a paper
>>>>about DTS once, though... so I kinda assumed that was how crafty worked.
>>>>
>>>>Search is recursive.  Yes, the thread that initiated the split must be the last
>>>>out of the split.  The way I did aborts is that every thread searching under a
>>>>split has a pointer in its "thread context" to the split node it is working
>>>>under.  When one thread fails high it toggles a flag in the split node.  Every N
>>>>nodes all thread check to see if the move timer has expired... and if they are
>>>>under a split they also see if their split has been aborted (i.e. some other
>>>>thread failed high).  In either case the search just unrolls.
>>>>
>>>>Because the thread that initiated the split has to be the last out I double
>>>>reference the split with that thread.  Before it returns to the main search
>>>>frame that called into SearchInParallel that thread spins waiting on the split
>>>>to become unreferenced by helper threads.  Because of this we want to limit the
>>>>number of aborts and limit the amount of time between when one thread realizes
>>>>it has failed high and when the rest of the helpers unroll.
>>>>
>>>>I suppose this could be optimized but with my dual proc it seems to work fine.
>>>>
>>>>Scott
>>>
>>>OK, this sounds very very similar to the Crafty algorithm, which gives a pretty
>>>reasonable speedup.  It is worth noting, however, that Crafty does not use DTS.
>>>I hope you'll forgive me for not talking more about Zappa's parallel algorithm,
>>>but I spent quite some time on it, and it requires an iterative search, so most
>>>people wouldn't want to implement it anyway.
>>>
>>>anthony
>>
>>I just implemented the most straightforward thing I could think of, I'm sure
>>there are improvements.  Anothony alluded to one weakness of this model: there
>>is a definite "main thread" and pool of "helpers".  The helpers spend their
>>lives in these functions: IdleLoop, HelpSearch, Search, QSearch, and Eval.
>>Whereas the main thread's job stays the same.  With more than 2 cpus a helper
>>thread could, in theory, re-split the tree below a split... I did this on
>>purpose but I have really never tested it b/c I have never run on more than 2
>>cpus.  I suspect there  might be some issues lurking.  But the weakness is that
>>the thread that initialzed the split has to wait for all the rest of the helpers
>>to see an abort and get out of the split.  I suspect there is a way around this
>>but it's non-trivial.
>
>There is a way around it.  :)
>
>Send your original thread into the same function you park the other thread in,
>ie in crafty it is called "ThreadWait()".  Then the master can finish at that
>node first, go back to ThreadWait() and the other thread can ask it to help
>deeper in the tree.  Works fine in Crafty...
>
>

Ah I see now.  So instead of the thread that initiated the split spinning while
waiting for the rest of the threads to finish up it can jump into IdleLoop and
help out at a split further down the tree.  This is clever and something I will
have to think more about when/if I try to tackle a 4cpu machine.  Thanks, Bob.

All I need now is a dual opteron with hyperthreading (late 2005?) or a quad xeon
or something.   And of course the funds to afford such a thing...

Scott

>>
>>A couple of other notes for people who care enough to read out to the end of
>>this thread: I only handle user input from the main thread so that I do not have
>>to lock anything in the input parsing routines.  I actually implemented a
>>version where every thread could handle input but it was needlessly complicated
>>and really served no purpose.
>>
>>I use the timestamp counter (RDTSC) to count cycles and compute a busyness
>>number for every helper thread.  You want this at 95% or above -- if a thread is
>>spending more than 5% of its time in the IdleLoop there is something wrong.
>>
>>Before you start working on parallel search make sure you understand
>>multithreaded programming.  One thing that will also help is googling for "x86
>>memory model".  In fact I have this text pasted at the top of my split.c file:
>>
>>    "The most familiar model is X86.  It's a relatively strong model.
>>     Stores are never reordered with respect to other stores.  But, in
>>     the absence of data dependence, loads can be reordered with
>>     respect to other loads and stores.  Many X86 developers don't
>>     realize that this reordering is possible, though it can lead to
>>     some nasty failures under stress on big MP machines.
>>
>>     In terms of the above, the memory model for X86 can be described as:
>>
>>        1. All stores are actually store.release (upward memory fence).
>>           This means that normal loads can be moved above the
>>           store.release but nothing can be moved below it.
>>        2. All loads are normal loads.
>>        3. Any use of the LOCK prefix (e.g. LOCK CMPXCHG or LOCK INC)
>>           creates a full memory fence."
>>
>>I don't know what that's from, some web page I read sometime.  But if you
>>understand what it says it will help you with when you can get away with not
>>locking somethimg, as long as you are willing to limit yourself to one processor
>>family (x86).
>>
>>Good luck to any prospective parallel searchers!
>>
>>Scott
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.