Author: Vasik Rajlich
Date: 07:54:23 01/21/06
Go up one level in this thread
On January 20, 2006 at 11:35:53, Mridul Muralidharan wrote: >Hi Vas, > > Responses at the bottom , inline. > >Regards, >Mridul > >On January 20, 2006 at 09:14:26, Vasik Rajlich wrote: > >>On January 19, 2006 at 22:10:05, Mridul Muralidharan wrote: >> >>>On January 19, 2006 at 16:45:37, Tord Romstad wrote: >>> >>>>On January 19, 2006 at 15:02:26, Dann Corbit wrote: >>>> >>>>>On January 19, 2006 at 03:48:34, Tord Romstad wrote: >>>>> >>>>>>Why? Wouldn't a chess engine using multiple processes and >>>>>>shared memory be just as fast as a similar program with >>>>>>multiple threads on a computer with several CPUs (or CPUs >>>>>>with multiple cores)? I thought the engine using processes >>>>>>would just consume a bit more memory, and not have any >>>>>>significant disadvantages apart from that. >>>>>> >>>>>>I'm not saying that you are wrong, of course. I am totally >>>>>>ignorant about this subject, and I ask in order to learn more. >>>>> >>>>>Process startup time, verses thread startup time can be significant. >>>> >>>>Yes, of course, but I don't see why you would want to start new >>>>processes except during program initialisation. It seems to me >>>>that the obvious thing to do would be to start all processes at >>>>once during program initialisation, and keep them running >>>>throughout the game. When there is no work to do, a process >>>>would simply return to some sort of idle loop. >>>> >>>>>An executable consumes more computer resources than a thread. >>>>> >>>>>It's not a stupendous difference. If you look at the efficiency of Deep Sjeng >>>>>(which uses processes) it's really not that bad. The place where it would hurt >>>>>the most is in the extremes (e.g. 8 CPUs with 4 cores each). >>>> >>>>I still haven't decided what I will use. Threads don't really >>>>look that much harder to me, but perhaps I'm missing something. >>>>I'm sure there are lots of difficulties I won't understand before >>>>I try. >>>> >>>>Tord >>> >>> >>>Hi Tord, all, >>> >>> First off , potential boring blurb ... in which case , please ignore :) >>> >>> In general for chess , my personal preference is to use processes than threads >>>from KISS perspective. I am going to ignore the contentious issues of >>>performance , cache locality issues , etc debates here and just advocate based >>>on this one point ;-) >>> >>> From my expierence , having very well defined and explict points of >>>interaction between the processes through the shared memory and mutexes really >>>really helps. >>>More like having clearly defined interfaces (through data though) - there are >>>sufficiently high number of race conditions already possible (esp initially when >>>you start off) without introducing additional possibilities if you are trying >>>out DTS , etc :-) >>> >>>It could be argued that parallelisation can be effeciently done with threads and >>>it can be - but you do incur a more stack based approach of shuffling >>>intermeadiate variables around and need to be very very careful. >>> >>>Example : >>>In eval , typically I have an attack_table[2][64] initialised at begining (now I >>>have increamental attacktable - but idea is the same I would guess) and use it >>>everywhere else , ditto for other forms of info like attack_patterns , >>>king_safety , pawn_defects , etc. >>>Simialrly , my pawn tables , eval hash tables , etc should also need to be >>>lockless or assuming more complex/larger info storage in the tables - gated. >>> >>>If you are using threads , either you would use a stack based approach and pass >>>these around , or you will put them in a 'this_proc' and use it from there (so >>>an extra indirection for nearly everything : this_proc->attack_table[] , etc) >>> >>>One possible downside is that egtb probing will be incurring higher penality >>>(from what I remember - I could be wrong here) since you will not be using the >>>advantage of the caching done by Eugene Nalimov across the varios processes ... >>>this will be per process. >>>Maybe someday this will be shared memory and remove that penality ! >>> >>>Ofcourse , at the end of the day , it is just how you think about the problem >>>and your solutions to tackle them - there are no HUGE drawbacks of either >>>approaches unless they are specific to some platforms. >>>And the problems you solve IMO are pretty similar in both approaches. >>>Amd dont forget - we should have fun solving these problems :) so use an >>>approach which suits you. >>> >>>Would be great to see a parallel Gothmog or Glaurung version out there ! >>>And dont worry , there is an unjust perception that writing parallel engine with >>>good speedup is tough - it is not really that tough from my expierences :) >>> >>>Regards, >>>Mridul >>> >>>PS : >>>Simplest way to kickstart a parallel engine ? >>>1) Get/borrow a parallel box - atleast a dual core. >>> >>>2) Design a parallel search on paper which can scale upto say 16 or 32 (forget >>>that you might never run on it anytime soon) - it is better to have something >>>which can potentially scale higher than having to rework code after a couple of >>>years. (zappa , diep , sjeng , crafty , etc) >>> >>>3) The way I started 3 weeks ago was to write basic make/unmake move , movegen , >>>etc and a parallel perft and make this work. Then build on top of it. >>>This can be easily validated for correctness and yet should give you nearly >>>linear perfect 'time to solution' scaling. >>> >>>4) Lots of assertions - I have '#ifdef MP_DEBUG' which valiadates all my locks >>>are freed/acquired in most of the relevent portions , shared memory state , etc >>>: this was a huge validation that my search was not messing up. >>> >>>5) Have lot of coffee - especially when debugging at night :) >>> >>>Now my average speedup is 1.89x for time to depth for the positions that Vincent >>>posted (Thanks Daniel !) with worst case being 1.3 and hopefully improving (on >>>my dual core athlon box). >> >>Mridul, >> >>this 1.89 speedup, you got this by splitting off a new process (rather than >>thread)? > >Yep , I spawn (num_proc - 1) child processes initially and they continue on till >the end of the program. >Currently , I just 'brutally kill' them when the main process exit's (should >move to some flag or signal mechanism which allows cleanup ... hmm , nothing to >cleanup for now !). > >Neither do I detect the death of a child process (can happen only though a >explict kill since there are no major parallel bugs which can dump core I guess) >- more so since I am not sure how to do this in windows (SIGCHLD should help in >unix) : anyone having an idea to solve this btw ? Thanks. > >> >>I would be interested to hear more about this. Ie. when the process is started >>and (very roughly) what you pass to it. > >Currently (like I explained to Tord) , I am trying to unify both the windows and >the unix codepaths. >So essentially , I use the argv[] to pass on : >* a string indicating that it is a child process >* a opaque string (randomly generated in parent) which is used to arrive at the >various shm id's. > >Currently , the shm's are like this : > >* A 'root_search_position' is created in the parent and all children attach to >this. >This is used to hold the initial conditions of the search , search state , etc. >* Each child creates a 'search_block' which hold the state of search of that >child and some related info (like proc_id , parent_id , etc). >Once created , every other child/parent will attach to this. >So in all processes (child and parent) , I have : > >extern search_block **child_shm_blocks; >extern search_block *this_proc; >extern root_search_position *root_search_data; > >All communication between the various processes happens through these shm >blocks. > >> >>A "full threaded parallel search" like in Crafty seems like a good chunk of >>work, not to mention a pain in the rear end to debug. >> >>Vas > >Well , it took me three+ weeks to get the above mentioned speedup ... I have not >tried it on a quad because of lack of availablility of a machine. >But then , I think zappa , diep get much better values and maybe so does sjeng >from what I heard : and they are known to scale to much higher number of procs ! >So there is still scope for a lot of improvements for me :-D > >I think there are differences between what crafty does and what I do - but the >basic ideas would be the same I guess : except that subtree's can move proc's in >my case and not in crafty (hope I am correct ?). I think from what Hyatt >mentioned , crayblitz did allow this though. > >Please note , a single proc version of my program is not tuned or tested or >bugfree like most programs out here. >I hack up a version just to test a specific idea - so effective use of the >single proc would be enough to beat my parallel programs ;-) > >Mridul Hi Mridul, thanks for the description. I will probably bug you with some questions in a few weeks when I start looking in more detail at this stuff. You've already got me thinking now which is always a bad sign :) Vas
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.