Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: To Vasik - What is the progress of MP Rybka ?

Author: Mridul Muralidharan

Date: 08:35:53 01/20/06

Go up one level in this thread


Hi Vas,

  Responses at the bottom , inline.

Regards,
Mridul

On January 20, 2006 at 09:14:26, Vasik Rajlich wrote:

>On January 19, 2006 at 22:10:05, Mridul Muralidharan wrote:
>
>>On January 19, 2006 at 16:45:37, Tord Romstad wrote:
>>
>>>On January 19, 2006 at 15:02:26, Dann Corbit wrote:
>>>
>>>>On January 19, 2006 at 03:48:34, Tord Romstad wrote:
>>>>
>>>>>Why?  Wouldn't a chess engine using multiple processes and
>>>>>shared memory be just as fast as a similar program with
>>>>>multiple threads on a computer with several CPUs (or CPUs
>>>>>with multiple cores)?  I thought the engine using processes
>>>>>would just consume a bit more memory, and not have any
>>>>>significant disadvantages apart from that.
>>>>>
>>>>>I'm not saying that you are wrong, of course.  I am totally
>>>>>ignorant about this subject, and I ask in order to learn more.
>>>>
>>>>Process startup time, verses thread startup time can be significant.
>>>
>>>Yes, of course, but I don't see why you would want to start new
>>>processes except during program initialisation.  It seems to me
>>>that the obvious thing to do would be to start all processes at
>>>once during program initialisation, and keep them running
>>>throughout the game.  When there is no work to do, a process
>>>would simply return to some sort of idle loop.
>>>
>>>>An executable consumes more computer resources than a thread.
>>>>
>>>>It's not a stupendous difference.  If you look at the efficiency of Deep Sjeng
>>>>(which uses processes) it's really not that bad.  The place where it would hurt
>>>>the most is in the extremes (e.g. 8 CPUs with 4 cores each).
>>>
>>>I still haven't decided what I will use.  Threads don't really
>>>look that much harder to me, but perhaps I'm missing something.
>>>I'm sure there are lots of difficulties I won't understand before
>>>I try.
>>>
>>>Tord
>>
>>
>>Hi Tord, all,
>>
>>  First off , potential boring blurb ... in which case , please ignore :)
>>
>>  In general for chess , my personal preference is to use processes than threads
>>from KISS perspective. I am going to ignore the contentious issues of
>>performance , cache locality issues , etc debates here and just advocate based
>>on this one point ;-)
>>
>>  From my expierence , having very well defined and explict points of
>>interaction between the processes through the shared memory and mutexes really
>>really helps.
>>More like having clearly defined interfaces (through data though) - there are
>>sufficiently high number of race conditions already possible (esp initially when
>>you start off) without introducing additional possibilities if you are trying
>>out DTS , etc :-)
>>
>>It could be argued that parallelisation can be effeciently done with threads and
>>it can be - but you do incur a more stack based approach of shuffling
>>intermeadiate variables around and need to be very very careful.
>>
>>Example :
>>In eval , typically I have an attack_table[2][64] initialised at begining (now I
>>have increamental attacktable - but idea is the same I would guess) and use it
>>everywhere else , ditto for other forms of info like attack_patterns ,
>>king_safety , pawn_defects , etc.
>>Simialrly , my pawn tables , eval hash tables , etc should also need to be
>>lockless or assuming more complex/larger info storage in the tables - gated.
>>
>>If you are using threads , either you would use a stack based approach and pass
>>these around , or you will put them in a 'this_proc' and use it from there (so
>>an extra indirection for nearly everything : this_proc->attack_table[] , etc)
>>
>>One possible downside is that egtb probing will be incurring higher penality
>>(from what I remember - I could be wrong here) since you will not be using the
>>advantage of the caching done by Eugene Nalimov across the varios processes ...
>>this will be per process.
>>Maybe someday this will be shared memory and remove that penality !
>>
>>Ofcourse , at the end of the day , it is just how you think about the problem
>>and your solutions to tackle them - there are no HUGE drawbacks of either
>>approaches unless they are specific to some platforms.
>>And the problems you solve IMO are pretty similar in both approaches.
>>Amd dont forget - we should have fun solving these problems :) so use an
>>approach which suits you.
>>
>>Would be great to see a parallel Gothmog or Glaurung version out there !
>>And dont worry , there is an unjust perception that writing parallel engine with
>>good speedup is tough - it is not really that tough from my expierences :)
>>
>>Regards,
>>Mridul
>>
>>PS :
>>Simplest way to kickstart a parallel engine ?
>>1) Get/borrow a parallel box - atleast a dual core.
>>
>>2) Design a parallel search on paper which can scale upto say 16 or 32 (forget
>>that you might never run on it anytime soon) - it is better to have something
>>which can potentially scale higher than having to rework code after a couple of
>>years. (zappa , diep , sjeng , crafty , etc)
>>
>>3) The way I started 3 weeks ago was to write basic make/unmake move , movegen ,
>>etc and a parallel perft and make this work. Then build on top of it.
>>This can be easily validated for correctness and yet should give you nearly
>>linear perfect 'time to solution' scaling.
>>
>>4) Lots of assertions - I have '#ifdef MP_DEBUG' which valiadates all my locks
>>are freed/acquired in most of the relevent portions , shared memory state , etc
>>: this was a huge validation that my search was not messing up.
>>
>>5) Have lot of coffee - especially when debugging at night :)
>>
>>Now my average speedup is 1.89x for time to depth for the positions that Vincent
>>posted (Thanks Daniel !) with worst case being 1.3 and hopefully improving (on
>>my dual core athlon box).
>
>Mridul,
>
>this 1.89 speedup, you got this by splitting off a new process (rather than
>thread)?

Yep , I spawn (num_proc - 1) child processes initially and they continue on till
the end of the program.
Currently , I just 'brutally kill' them when the main process exit's (should
move to some flag or signal mechanism which allows cleanup ... hmm , nothing to
cleanup for now !).

Neither do I detect the death of a child process (can happen only though a
explict kill since there are no major parallel bugs which can dump core I guess)
- more so since I am not sure how to do this in windows (SIGCHLD should help in
unix) : anyone having an idea to solve this btw ? Thanks.

>
>I would be interested to hear more about this. Ie. when the process is started
>and (very roughly) what you pass to it.

Currently (like I explained to Tord) , I am trying to unify both the windows and
the unix codepaths.
So essentially , I use the argv[] to pass on :
* a string indicating that it is a child process
* a opaque string (randomly generated in parent) which is used to arrive at the
various shm id's.

Currently , the shm's are like this :

* A 'root_search_position' is created in the parent and all children attach to
this.
This is used to hold the initial conditions of the search , search state , etc.
* Each child creates a 'search_block' which hold the state of search of that
child and some related info (like proc_id , parent_id , etc).
Once created , every other child/parent will attach to this.
So in all processes (child and parent) , I have :

extern search_block **child_shm_blocks;
extern search_block *this_proc;
extern root_search_position *root_search_data;

All communication between the various processes happens through these shm
blocks.

>
>A "full threaded parallel search" like in Crafty seems like a good chunk of
>work, not to mention a pain in the rear end to debug.
>
>Vas

Well , it took me three+ weeks to get the above mentioned speedup ... I have not
tried it on a quad because of lack of availablility of a machine.
But then , I think zappa , diep get much better values and maybe so does sjeng
from what I heard : and they are known to scale to much higher number of procs !
So there is still scope for a lot of improvements for me :-D

I think there are differences between what crafty does and what I do - but the
basic ideas would be the same I guess : except that subtree's can move proc's in
my case and not in crafty (hope I am correct ?). I think from what Hyatt
mentioned , crayblitz did allow this though.

Please note , a single proc version of my program is not tuned or tested or
bugfree like most programs out here.
I hack up a version just to test a specific idea - so effective use of the
single proc would be enough to beat my parallel programs ;-)

Mridul



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.