Author: Vincent Diepeveen
Date: 04:04:33 09/04/03
Go up one level in this thread
On September 03, 2003 at 16:44:10, Robert Hyatt wrote: >On September 03, 2003 at 15:35:01, Vincent Diepeveen wrote: > >>On September 03, 2003 at 14:10:53, Robert Hyatt wrote: >> >>>On September 03, 2003 at 13:33:19, Gian-Carlo Pascutto wrote: >>> >>>>On September 03, 2003 at 13:22:50, Robert Hyatt wrote: >>>> >>>>>It is a _TINY_ part of the total time spent. So tiny, it can be ignored. >>>> >>>>Que? >>>> >>>>Maybe so on an SMP quad (as I stated), but surely not on a large NUMA system. >>>> >>>>If this isn't the issue, I'd expect my thing to run like the blazes >>>>on a NUMA box, but I doubt I'm that lucky. >>>> >>>>-- >>>>GCP >>> >>> >>>There are three things that have to be done by a thread: >>> >>>1. copy local data somewhere else for another thread to use (splitting in >>>crafty terminology). That happens once per "split". How many splits are done? >>> >>>Here is the data I provided in another thread here... >>> >>> SMP-> split=6266 stop=875 data=19/64 cpu=10:00 elap=2:39 >>> SMP-> split=3511 stop=440 data=16/64 cpu=5:20 elap=1:27 >>> SMP-> split=3768 stop=524 data=17/64 cpu=5:45 elap=1:33 >>> SMP-> split=1724 stop=275 data=13/64 cpu=3:59 elap=1:04 >>> SMP-> split=4894 stop=671 data=15/64 cpu=3:55 elap=1:03 >>> SMP-> split=2666 stop=420 data=15/64 cpu=3:51 elap=1:02 >>> SMP-> split=3412 stop=683 data=17/64 cpu=3:46 elap=1:00 >>> SMP-> split=3447 stop=476 data=15/64 cpu=3:55 elap=1:03 >>> SMP-> split=2985 stop=345 data=19/64 cpu=1:13 elap=19.53 >>> SMP-> split=11657 stop=1620 data=23/64 cpu=3:32 elap=58.12 >>> SMP-> split=1928 stop=292 data=17/64 cpu=3:24 elap=57.08 >>> SMP-> split=53912 stop=6999 data=30/64 cpu=32:06 elap=8:42 >>> SMP-> split=9997 stop=1209 data=23/64 cpu=3:31 elap=56.69 >>> SMP-> split=2966 stop=527 data=19/64 cpu=3:28 elap=55.49 >>> >>>Worst case was 54000 splits for a 9 minute long search. Using 4 processors. >>>More typical seems to be about 500 splits per minute of search. That is >>>not much time. >> >>05:11 <nps censored> 0 0 487383460 (130) 14 (85565,1592299) 0.001 d2-d4 Ng8-f6 >>Ng1-f3 d7-d5 Bc1-f4 e7-e6 e2-e3 Bf8-d6 Bf1-e2 Nb8-c6 O-O Bd6xf4 e3xf4 O-O >> >>1.59MLN splits / 311 seconds = 5119 splits a second >>Or that's 39 splits a second a processor. >> >>Of course in crafty you limit the number of splits bigtime by the conditions >>used. > >Yes, but I can still drive 4 cpus to good utilization. If I limit splits, >that utilization goes down significantly. Some samples (Notice I _always_ >give real data rather than waving my hands): > >split with N plies remaining cpu utilization elapsed time > N splits done > > 1 10203 395% 30.3s > 2 5017 386% 28.1s > 3 1982 385% 28.2s > 4 1011 385% 29.7s > 5 677 377% 27.8s > 6 352 364% 28.5s > >The fastest setting for this particular position seems to be smpmin=5, >where the default is 4. But over many tests, smpmin=4 seems to be the >right value for this version of crafty, this hardware. > >I hardly call that "limiting the number of splits done big-time". N=1 >means I split at the last ply and call q-search in parallel. N=2 means I >split at the next to last ply. Ditto through N=6. This was a search to >depth=13 in a middlegame position, for reference. > > > >> >>But the more splits a second the better the speedup according to my >>measurements. > >Depends. Splits near the tip are not as good as move ordering at the >tips is worse than move ordering farther up into the tree. > >> >>When i split dual 10 times a second a cpu, then the speedup is like 1.7 like >>crafty. > >That's the first time I have seen you use 1.7 for Crafty. Usually it is >1.0 or 1.2. Finally coming back to the real world after testing some? it depends upon how you measure. you always have stuff that cripples its play but is good for speedup. remember that i tested crafty at a dual k7 with asymmetric king safety turned off. if you would print out the objective node counts at each main variation then we directly know where we talk about. Crafty search is too inefficient to take its parallel search serious. > > > > >> >>When i let diep split 30 or 40 times a second then speedup is 1.9 to 2.0 > > >Or > 2, no doubt if you split 100 times a second... :) > > > >> >>Thank you, >>Vincent >> >>>2. Search. Here I only do local memory accesses, so there is just normal >>>tree search overhead, nothing related to NUMA. >>> >>>3. completion. Here I have to either copy a score/PV or just score back to >>>the parent thread data or set a "stop" flag to say my result is good enough, >>>no others are needed. Either of these is a trivial amount of non-local memory >>>traffic. >>> >>>If you do that right, NUMA should not hurt. The issue is going to become >>>how to use a large number of processors, which is much harder to do that >>>to use a small number as we are today.
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.