Author: Mridul Muralidharan
Date: 08:35:53 01/20/06
Go up one level in this thread
Hi Vas, Responses at the bottom , inline. Regards, Mridul On January 20, 2006 at 09:14:26, Vasik Rajlich wrote: >On January 19, 2006 at 22:10:05, Mridul Muralidharan wrote: > >>On January 19, 2006 at 16:45:37, Tord Romstad wrote: >> >>>On January 19, 2006 at 15:02:26, Dann Corbit wrote: >>> >>>>On January 19, 2006 at 03:48:34, Tord Romstad wrote: >>>> >>>>>Why? Wouldn't a chess engine using multiple processes and >>>>>shared memory be just as fast as a similar program with >>>>>multiple threads on a computer with several CPUs (or CPUs >>>>>with multiple cores)? I thought the engine using processes >>>>>would just consume a bit more memory, and not have any >>>>>significant disadvantages apart from that. >>>>> >>>>>I'm not saying that you are wrong, of course. I am totally >>>>>ignorant about this subject, and I ask in order to learn more. >>>> >>>>Process startup time, verses thread startup time can be significant. >>> >>>Yes, of course, but I don't see why you would want to start new >>>processes except during program initialisation. It seems to me >>>that the obvious thing to do would be to start all processes at >>>once during program initialisation, and keep them running >>>throughout the game. When there is no work to do, a process >>>would simply return to some sort of idle loop. >>> >>>>An executable consumes more computer resources than a thread. >>>> >>>>It's not a stupendous difference. If you look at the efficiency of Deep Sjeng >>>>(which uses processes) it's really not that bad. The place where it would hurt >>>>the most is in the extremes (e.g. 8 CPUs with 4 cores each). >>> >>>I still haven't decided what I will use. Threads don't really >>>look that much harder to me, but perhaps I'm missing something. >>>I'm sure there are lots of difficulties I won't understand before >>>I try. >>> >>>Tord >> >> >>Hi Tord, all, >> >> First off , potential boring blurb ... in which case , please ignore :) >> >> In general for chess , my personal preference is to use processes than threads >>from KISS perspective. I am going to ignore the contentious issues of >>performance , cache locality issues , etc debates here and just advocate based >>on this one point ;-) >> >> From my expierence , having very well defined and explict points of >>interaction between the processes through the shared memory and mutexes really >>really helps. >>More like having clearly defined interfaces (through data though) - there are >>sufficiently high number of race conditions already possible (esp initially when >>you start off) without introducing additional possibilities if you are trying >>out DTS , etc :-) >> >>It could be argued that parallelisation can be effeciently done with threads and >>it can be - but you do incur a more stack based approach of shuffling >>intermeadiate variables around and need to be very very careful. >> >>Example : >>In eval , typically I have an attack_table[2][64] initialised at begining (now I >>have increamental attacktable - but idea is the same I would guess) and use it >>everywhere else , ditto for other forms of info like attack_patterns , >>king_safety , pawn_defects , etc. >>Simialrly , my pawn tables , eval hash tables , etc should also need to be >>lockless or assuming more complex/larger info storage in the tables - gated. >> >>If you are using threads , either you would use a stack based approach and pass >>these around , or you will put them in a 'this_proc' and use it from there (so >>an extra indirection for nearly everything : this_proc->attack_table[] , etc) >> >>One possible downside is that egtb probing will be incurring higher penality >>(from what I remember - I could be wrong here) since you will not be using the >>advantage of the caching done by Eugene Nalimov across the varios processes ... >>this will be per process. >>Maybe someday this will be shared memory and remove that penality ! >> >>Ofcourse , at the end of the day , it is just how you think about the problem >>and your solutions to tackle them - there are no HUGE drawbacks of either >>approaches unless they are specific to some platforms. >>And the problems you solve IMO are pretty similar in both approaches. >>Amd dont forget - we should have fun solving these problems :) so use an >>approach which suits you. >> >>Would be great to see a parallel Gothmog or Glaurung version out there ! >>And dont worry , there is an unjust perception that writing parallel engine with >>good speedup is tough - it is not really that tough from my expierences :) >> >>Regards, >>Mridul >> >>PS : >>Simplest way to kickstart a parallel engine ? >>1) Get/borrow a parallel box - atleast a dual core. >> >>2) Design a parallel search on paper which can scale upto say 16 or 32 (forget >>that you might never run on it anytime soon) - it is better to have something >>which can potentially scale higher than having to rework code after a couple of >>years. (zappa , diep , sjeng , crafty , etc) >> >>3) The way I started 3 weeks ago was to write basic make/unmake move , movegen , >>etc and a parallel perft and make this work. Then build on top of it. >>This can be easily validated for correctness and yet should give you nearly >>linear perfect 'time to solution' scaling. >> >>4) Lots of assertions - I have '#ifdef MP_DEBUG' which valiadates all my locks >>are freed/acquired in most of the relevent portions , shared memory state , etc >>: this was a huge validation that my search was not messing up. >> >>5) Have lot of coffee - especially when debugging at night :) >> >>Now my average speedup is 1.89x for time to depth for the positions that Vincent >>posted (Thanks Daniel !) with worst case being 1.3 and hopefully improving (on >>my dual core athlon box). > >Mridul, > >this 1.89 speedup, you got this by splitting off a new process (rather than >thread)? Yep , I spawn (num_proc - 1) child processes initially and they continue on till the end of the program. Currently , I just 'brutally kill' them when the main process exit's (should move to some flag or signal mechanism which allows cleanup ... hmm , nothing to cleanup for now !). Neither do I detect the death of a child process (can happen only though a explict kill since there are no major parallel bugs which can dump core I guess) - more so since I am not sure how to do this in windows (SIGCHLD should help in unix) : anyone having an idea to solve this btw ? Thanks. > >I would be interested to hear more about this. Ie. when the process is started >and (very roughly) what you pass to it. Currently (like I explained to Tord) , I am trying to unify both the windows and the unix codepaths. So essentially , I use the argv[] to pass on : * a string indicating that it is a child process * a opaque string (randomly generated in parent) which is used to arrive at the various shm id's. Currently , the shm's are like this : * A 'root_search_position' is created in the parent and all children attach to this. This is used to hold the initial conditions of the search , search state , etc. * Each child creates a 'search_block' which hold the state of search of that child and some related info (like proc_id , parent_id , etc). Once created , every other child/parent will attach to this. So in all processes (child and parent) , I have : extern search_block **child_shm_blocks; extern search_block *this_proc; extern root_search_position *root_search_data; All communication between the various processes happens through these shm blocks. > >A "full threaded parallel search" like in Crafty seems like a good chunk of >work, not to mention a pain in the rear end to debug. > >Vas Well , it took me three+ weeks to get the above mentioned speedup ... I have not tried it on a quad because of lack of availablility of a machine. But then , I think zappa , diep get much better values and maybe so does sjeng from what I heard : and they are known to scale to much higher number of procs ! So there is still scope for a lot of improvements for me :-D I think there are differences between what crafty does and what I do - but the basic ideas would be the same I guess : except that subtree's can move proc's in my case and not in crafty (hope I am correct ?). I think from what Hyatt mentioned , crayblitz did allow this though. Please note , a single proc version of my program is not tuned or tested or bugfree like most programs out here. I hack up a version just to test a specific idea - so effective use of the single proc would be enough to beat my parallel programs ;-) Mridul
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.