Author: Robert Hyatt
Date: 08:12:40 09/04/03
Go up one level in this thread
On September 04, 2003 at 07:04:33, Vincent Diepeveen wrote: >On September 03, 2003 at 16:44:10, Robert Hyatt wrote: > >>On September 03, 2003 at 15:35:01, Vincent Diepeveen wrote: >> >>>On September 03, 2003 at 14:10:53, Robert Hyatt wrote: >>> >>>>On September 03, 2003 at 13:33:19, Gian-Carlo Pascutto wrote: >>>> >>>>>On September 03, 2003 at 13:22:50, Robert Hyatt wrote: >>>>> >>>>>>It is a _TINY_ part of the total time spent. So tiny, it can be ignored. >>>>> >>>>>Que? >>>>> >>>>>Maybe so on an SMP quad (as I stated), but surely not on a large NUMA system. >>>>> >>>>>If this isn't the issue, I'd expect my thing to run like the blazes >>>>>on a NUMA box, but I doubt I'm that lucky. >>>>> >>>>>-- >>>>>GCP >>>> >>>> >>>>There are three things that have to be done by a thread: >>>> >>>>1. copy local data somewhere else for another thread to use (splitting in >>>>crafty terminology). That happens once per "split". How many splits are done? >>>> >>>>Here is the data I provided in another thread here... >>>> >>>> SMP-> split=6266 stop=875 data=19/64 cpu=10:00 elap=2:39 >>>> SMP-> split=3511 stop=440 data=16/64 cpu=5:20 elap=1:27 >>>> SMP-> split=3768 stop=524 data=17/64 cpu=5:45 elap=1:33 >>>> SMP-> split=1724 stop=275 data=13/64 cpu=3:59 elap=1:04 >>>> SMP-> split=4894 stop=671 data=15/64 cpu=3:55 elap=1:03 >>>> SMP-> split=2666 stop=420 data=15/64 cpu=3:51 elap=1:02 >>>> SMP-> split=3412 stop=683 data=17/64 cpu=3:46 elap=1:00 >>>> SMP-> split=3447 stop=476 data=15/64 cpu=3:55 elap=1:03 >>>> SMP-> split=2985 stop=345 data=19/64 cpu=1:13 elap=19.53 >>>> SMP-> split=11657 stop=1620 data=23/64 cpu=3:32 elap=58.12 >>>> SMP-> split=1928 stop=292 data=17/64 cpu=3:24 elap=57.08 >>>> SMP-> split=53912 stop=6999 data=30/64 cpu=32:06 elap=8:42 >>>> SMP-> split=9997 stop=1209 data=23/64 cpu=3:31 elap=56.69 >>>> SMP-> split=2966 stop=527 data=19/64 cpu=3:28 elap=55.49 >>>> >>>>Worst case was 54000 splits for a 9 minute long search. Using 4 processors. >>>>More typical seems to be about 500 splits per minute of search. That is >>>>not much time. >>> >>>05:11 <nps censored> 0 0 487383460 (130) 14 (85565,1592299) 0.001 d2-d4 Ng8-f6 >>>Ng1-f3 d7-d5 Bc1-f4 e7-e6 e2-e3 Bf8-d6 Bf1-e2 Nb8-c6 O-O Bd6xf4 e3xf4 O-O >>> >>>1.59MLN splits / 311 seconds = 5119 splits a second >>>Or that's 39 splits a second a processor. >>> >>>Of course in crafty you limit the number of splits bigtime by the conditions >>>used. >> >>Yes, but I can still drive 4 cpus to good utilization. If I limit splits, >>that utilization goes down significantly. Some samples (Notice I _always_ >>give real data rather than waving my hands): >> >>split with N plies remaining cpu utilization elapsed time >> N splits done >> >> 1 10203 395% 30.3s >> 2 5017 386% 28.1s >> 3 1982 385% 28.2s >> 4 1011 385% 29.7s >> 5 677 377% 27.8s >> 6 352 364% 28.5s >> >>The fastest setting for this particular position seems to be smpmin=5, >>where the default is 4. But over many tests, smpmin=4 seems to be the >>right value for this version of crafty, this hardware. >> >>I hardly call that "limiting the number of splits done big-time". N=1 >>means I split at the last ply and call q-search in parallel. N=2 means I >>split at the next to last ply. Ditto through N=6. This was a search to >>depth=13 in a middlegame position, for reference. >> >> >> >>> >>>But the more splits a second the better the speedup according to my >>>measurements. >> >>Depends. Splits near the tip are not as good as move ordering at the >>tips is worse than move ordering farther up into the tree. >> >>> >>>When i split dual 10 times a second a cpu, then the speedup is like 1.7 like >>>crafty. >> >>That's the first time I have seen you use 1.7 for Crafty. Usually it is >>1.0 or 1.2. Finally coming back to the real world after testing some? > >it depends upon how you measure. > >you always have stuff that cripples its play but is good for speedup. >remember that i tested crafty at a dual k7 with asymmetric king safety turned >off. > SO? asymmetric king safety does _not_ make it run better or worse in parallel. Perhaps for a position or two here and there, one or the other is much better. But overall, no. I've _already_ run that test and posted the results. >if you would print out the objective node counts at each main variation then we >directly know where we talk about. Crafty search is too inefficient to take its >parallel search serious. Works better than yours however. _I_ don't have mysterious crashes and bugs all over the place. That "inefficient search" sure seems to give you plenty of problems on ICC. > >> >> >> >> >>> >>>When i let diep split 30 or 40 times a second then speedup is 1.9 to 2.0 >> >> >>Or > 2, no doubt if you split 100 times a second... :) >> >> >> >>> >>>Thank you, >>>Vincent >>> >>>>2. Search. Here I only do local memory accesses, so there is just normal >>>>tree search overhead, nothing related to NUMA. >>>> >>>>3. completion. Here I have to either copy a score/PV or just score back to >>>>the parent thread data or set a "stop" flag to say my result is good enough, >>>>no others are needed. Either of these is a trivial amount of non-local memory >>>>traffic. >>>> >>>>If you do that right, NUMA should not hurt. The issue is going to become >>>>how to use a large number of processors, which is much harder to do that >>>>to use a small number as we are today.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.