Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: To Vasik - What is the progress of MP Rybka ?

Author: Vasik Rajlich

Date: 07:54:23 01/21/06

Go up one level in this thread


On January 20, 2006 at 11:35:53, Mridul Muralidharan wrote:

>Hi Vas,
>
>  Responses at the bottom , inline.
>
>Regards,
>Mridul
>
>On January 20, 2006 at 09:14:26, Vasik Rajlich wrote:
>
>>On January 19, 2006 at 22:10:05, Mridul Muralidharan wrote:
>>
>>>On January 19, 2006 at 16:45:37, Tord Romstad wrote:
>>>
>>>>On January 19, 2006 at 15:02:26, Dann Corbit wrote:
>>>>
>>>>>On January 19, 2006 at 03:48:34, Tord Romstad wrote:
>>>>>
>>>>>>Why?  Wouldn't a chess engine using multiple processes and
>>>>>>shared memory be just as fast as a similar program with
>>>>>>multiple threads on a computer with several CPUs (or CPUs
>>>>>>with multiple cores)?  I thought the engine using processes
>>>>>>would just consume a bit more memory, and not have any
>>>>>>significant disadvantages apart from that.
>>>>>>
>>>>>>I'm not saying that you are wrong, of course.  I am totally
>>>>>>ignorant about this subject, and I ask in order to learn more.
>>>>>
>>>>>Process startup time, verses thread startup time can be significant.
>>>>
>>>>Yes, of course, but I don't see why you would want to start new
>>>>processes except during program initialisation.  It seems to me
>>>>that the obvious thing to do would be to start all processes at
>>>>once during program initialisation, and keep them running
>>>>throughout the game.  When there is no work to do, a process
>>>>would simply return to some sort of idle loop.
>>>>
>>>>>An executable consumes more computer resources than a thread.
>>>>>
>>>>>It's not a stupendous difference.  If you look at the efficiency of Deep Sjeng
>>>>>(which uses processes) it's really not that bad.  The place where it would hurt
>>>>>the most is in the extremes (e.g. 8 CPUs with 4 cores each).
>>>>
>>>>I still haven't decided what I will use.  Threads don't really
>>>>look that much harder to me, but perhaps I'm missing something.
>>>>I'm sure there are lots of difficulties I won't understand before
>>>>I try.
>>>>
>>>>Tord
>>>
>>>
>>>Hi Tord, all,
>>>
>>>  First off , potential boring blurb ... in which case , please ignore :)
>>>
>>>  In general for chess , my personal preference is to use processes than threads
>>>from KISS perspective. I am going to ignore the contentious issues of
>>>performance , cache locality issues , etc debates here and just advocate based
>>>on this one point ;-)
>>>
>>>  From my expierence , having very well defined and explict points of
>>>interaction between the processes through the shared memory and mutexes really
>>>really helps.
>>>More like having clearly defined interfaces (through data though) - there are
>>>sufficiently high number of race conditions already possible (esp initially when
>>>you start off) without introducing additional possibilities if you are trying
>>>out DTS , etc :-)
>>>
>>>It could be argued that parallelisation can be effeciently done with threads and
>>>it can be - but you do incur a more stack based approach of shuffling
>>>intermeadiate variables around and need to be very very careful.
>>>
>>>Example :
>>>In eval , typically I have an attack_table[2][64] initialised at begining (now I
>>>have increamental attacktable - but idea is the same I would guess) and use it
>>>everywhere else , ditto for other forms of info like attack_patterns ,
>>>king_safety , pawn_defects , etc.
>>>Simialrly , my pawn tables , eval hash tables , etc should also need to be
>>>lockless or assuming more complex/larger info storage in the tables - gated.
>>>
>>>If you are using threads , either you would use a stack based approach and pass
>>>these around , or you will put them in a 'this_proc' and use it from there (so
>>>an extra indirection for nearly everything : this_proc->attack_table[] , etc)
>>>
>>>One possible downside is that egtb probing will be incurring higher penality
>>>(from what I remember - I could be wrong here) since you will not be using the
>>>advantage of the caching done by Eugene Nalimov across the varios processes ...
>>>this will be per process.
>>>Maybe someday this will be shared memory and remove that penality !
>>>
>>>Ofcourse , at the end of the day , it is just how you think about the problem
>>>and your solutions to tackle them - there are no HUGE drawbacks of either
>>>approaches unless they are specific to some platforms.
>>>And the problems you solve IMO are pretty similar in both approaches.
>>>Amd dont forget - we should have fun solving these problems :) so use an
>>>approach which suits you.
>>>
>>>Would be great to see a parallel Gothmog or Glaurung version out there !
>>>And dont worry , there is an unjust perception that writing parallel engine with
>>>good speedup is tough - it is not really that tough from my expierences :)
>>>
>>>Regards,
>>>Mridul
>>>
>>>PS :
>>>Simplest way to kickstart a parallel engine ?
>>>1) Get/borrow a parallel box - atleast a dual core.
>>>
>>>2) Design a parallel search on paper which can scale upto say 16 or 32 (forget
>>>that you might never run on it anytime soon) - it is better to have something
>>>which can potentially scale higher than having to rework code after a couple of
>>>years. (zappa , diep , sjeng , crafty , etc)
>>>
>>>3) The way I started 3 weeks ago was to write basic make/unmake move , movegen ,
>>>etc and a parallel perft and make this work. Then build on top of it.
>>>This can be easily validated for correctness and yet should give you nearly
>>>linear perfect 'time to solution' scaling.
>>>
>>>4) Lots of assertions - I have '#ifdef MP_DEBUG' which valiadates all my locks
>>>are freed/acquired in most of the relevent portions , shared memory state , etc
>>>: this was a huge validation that my search was not messing up.
>>>
>>>5) Have lot of coffee - especially when debugging at night :)
>>>
>>>Now my average speedup is 1.89x for time to depth for the positions that Vincent
>>>posted (Thanks Daniel !) with worst case being 1.3 and hopefully improving (on
>>>my dual core athlon box).
>>
>>Mridul,
>>
>>this 1.89 speedup, you got this by splitting off a new process (rather than
>>thread)?
>
>Yep , I spawn (num_proc - 1) child processes initially and they continue on till
>the end of the program.
>Currently , I just 'brutally kill' them when the main process exit's (should
>move to some flag or signal mechanism which allows cleanup ... hmm , nothing to
>cleanup for now !).
>
>Neither do I detect the death of a child process (can happen only though a
>explict kill since there are no major parallel bugs which can dump core I guess)
>- more so since I am not sure how to do this in windows (SIGCHLD should help in
>unix) : anyone having an idea to solve this btw ? Thanks.
>
>>
>>I would be interested to hear more about this. Ie. when the process is started
>>and (very roughly) what you pass to it.
>
>Currently (like I explained to Tord) , I am trying to unify both the windows and
>the unix codepaths.
>So essentially , I use the argv[] to pass on :
>* a string indicating that it is a child process
>* a opaque string (randomly generated in parent) which is used to arrive at the
>various shm id's.
>
>Currently , the shm's are like this :
>
>* A 'root_search_position' is created in the parent and all children attach to
>this.
>This is used to hold the initial conditions of the search , search state , etc.
>* Each child creates a 'search_block' which hold the state of search of that
>child and some related info (like proc_id , parent_id , etc).
>Once created , every other child/parent will attach to this.
>So in all processes (child and parent) , I have :
>
>extern search_block **child_shm_blocks;
>extern search_block *this_proc;
>extern root_search_position *root_search_data;
>
>All communication between the various processes happens through these shm
>blocks.
>
>>
>>A "full threaded parallel search" like in Crafty seems like a good chunk of
>>work, not to mention a pain in the rear end to debug.
>>
>>Vas
>
>Well , it took me three+ weeks to get the above mentioned speedup ... I have not
>tried it on a quad because of lack of availablility of a machine.
>But then , I think zappa , diep get much better values and maybe so does sjeng
>from what I heard : and they are known to scale to much higher number of procs !
>So there is still scope for a lot of improvements for me :-D
>
>I think there are differences between what crafty does and what I do - but the
>basic ideas would be the same I guess : except that subtree's can move proc's in
>my case and not in crafty (hope I am correct ?). I think from what Hyatt
>mentioned , crayblitz did allow this though.
>
>Please note , a single proc version of my program is not tuned or tested or
>bugfree like most programs out here.
>I hack up a version just to test a specific idea - so effective use of the
>single proc would be enough to beat my parallel programs ;-)
>
>Mridul

Hi Mridul,

thanks for the description. I will probably bug you with some questions in a few
weeks when I start looking in more detail at this stuff. You've already got me
thinking now which is always a bad sign :)

Vas



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.