Author: Vasik Rajlich
Date: 02:31:53 04/17/04
Go up one level in this thread
On April 16, 2004 at 10:08:34, Vincent Diepeveen wrote: >On April 16, 2004 at 05:47:42, Vasik Rajlich wrote: > >>On April 15, 2004 at 13:10:26, Robert Hyatt wrote: >> >>>On April 15, 2004 at 12:45:23, Vincent Diepeveen wrote: >>> >>>>On April 15, 2004 at 09:01:44, Robert Hyatt wrote: >>>> >>>>>On April 15, 2004 at 06:05:15, Joachim Rang wrote: >>>>> >>>>>>On April 14, 2004 at 22:49:39, Robert Hyatt wrote: >>>>>> >>>>>>>I just finished some HT on / HT off tests to see how things have changed in >>>>>>>Crafty since some of the recent NUMA-related memory changes that were made. >>>>>>> >>>>>>>Point 1. HT now speeds Crafty up between 5 and 10% max. A year ago this was >>>>>>>30%. What did I learn? Nothing new. Memory waits benefit HT. Eugene and I >>>>>>>worked on removing several shared memory interactions which led to better cache >>>>>>>utilization, less cache invalidates (very slow) and improved performance a good >>>>>>>bit. But at the same time, now HT doesn't have the excessive memory waits it >>>>>>>had before and so the speedup is not as good. >>>>>>> >>>>>>>Point 2. HT now actually slows things down due to SMP overhead. IE I lose 30% >>>>>>>per CPU, roughly, due to SMP overhead. HT now only gives 5-10% back. This is a >>>>>>>net loss. I am now running my dual with HT disabled... >>>>>>> >>>>>>>More as I get more data... Here is two data points however: >>>>>>> >>>>>>>pos1. cpus=2 (no HT) NPS = 2.07M time=18.13 >>>>>>> cpus=4 NPS = 2.08M time=28.76 >>>>>>> >>>>>>>pos2. cpus=2 NPS = 1.87M time=58.48 >>>>>>> cpus=4 NPS = 2.01M time=66.00 >>>>>>> >>>>>>>First pos HT helps almost none in NPS, costs 10 seconds in search overhead. >>>>>>>Ugly. Position 2 gives about 5% more nps, but again the SMP overhead washes >>>>>>>that out and there is a net loss. I should run the speedup tests several times, >>>>>>>but the NPS numbers don't change much, and the speedup could change. But this >>>>>>>offers enough.. >>>>>> >>>>>> >>>>>>In a german Board someone postetd figures for the Fritzmark of Fritz 8. Fritz >>>>>>gains still 25% form HT (in this specific position) >>>>>> >>>>>>cpus=2 NPS = 2.35 >>>>>>cpus=4 NPS = 2,95 >>>>>> >>>>>>I have unfortunately no information about search time. >>>>>> >>>>>>Does that mean Fritz 8 is poorly optimized? >>>>>> >>>>>>regards Joachim >>>>> >>>>> >>>>>It means it has some cache issues that can be fixed to speed it up further, yes. >>>> >>>>Not at all. >>>> >>>>Fritz is p4 hand optimized assembly currently. I expect him to work hard on an >>>>opteron hand optimized assembly version from fritz now (probably already 1 year >>>>working at it by now). >>> >>>Sorry, but you should stick to topics you know something about. SMT works best >>>in programs where there are memory reads/writes that stall a thread. As you >>>work out those stalls, SMT pays off less gain. My current numbers clearly show >>>this as opposed to the numbers I (and others) posted when I first got my SMT >>>box... >>> >>>> >>>>A possibility could be that last years Fritz evaluation function has become so >>>>much slower than it was that it has most likely a need for an eval hashtable, >>>>just like i use in DIEP already for many years. >>>> >>>>My guess is that it just uses more hashtables than crafty. Crafty isn't probing >>>>in qsearch for example. DIEP is. Diep's doing a lot of more stuff in qsearch >>>>than crafty. So using a transposition table there makes a lot more sense. >>> >>>That is possible. However, as I said, it is a trade-off. I took hash out of >>>q-search and it was perfectly break-even. Tree grew a but but the search got >>>proportionally faster. No gain or loss. Yet it results in lower bandwidth and >>>with the PIV long cache line, it is probably (at least for Crafty) better than a >>>break-even deal today. >>> >>>> >>>>All commercial programs that i know (junior's search is so different that i >>>>would bet it is not the case with junior) are doing checks in qsearch. >>> >>>But he does not even hash probe in last ply of normal search.. >>> >>>And it appears he has no q-search. >>> >> >>Why do you say this? > >I drew that conclusion a few years ago. It doesn't need to be the case nowadays >in junior. > >>I guess the only alternative to q-search is some sort of an SEE at depth == 0. >>Or is there some other possibility? > >Suppose last few plies you just do a tactical verification search or whatever >and that you rely upon piece square tables. > >You can do a slow 'makemove' of course and then evaluate. > >You can also throw away the entire qsearch and make a small list of attacked >pieces for white and attacked pieces of black. > >Then return the evaluation + canwin(side); > >Keep the canwin function simple. > >That's very quick. > >By the way this is already in a book i read from Jaap v/d Herik. Written around >1984 or so... I think the right way to handle this should be sometimes canwin, sometimes q-search, depending on the position. If the question is canwin vs q-search, then in my testing q-search is the big winner. I don't know if it's because q-search is tactically stronger, or because q-search evaluates the actual positions you get. Vas > >>>> >>>>So a possible alternative to evaltable would be hashing in qsearch. >>>> >>>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at >>>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns. >>>> >>> >>> >>> >>>Depends. Use big memory pages and it costs 150 ns. No TLB thrashing then. >>> >> >>Based on some reading that I did on the k8, it seemed that a memory lookup was >>around 150 cycles there. (And 250 cycles on k7.) >> >>Did I misunderstand? Or does this number change when you use multiple >>processors? >> >>If so, then hashing should be done differently on multiple processors than on >>single processors. For example, ETC would behave differently. >> >>Vas >> >>> >>> >>>>That's even at 3Ghz just 1200 cycles. >>>> >>>>1 node on average costs assuming 1 mln nps at 3Ghz : 3000 cycles. >>>> >>>>Vaste majority of nodes do not get evaluated at all of course. >>>> >>>>That shows that Fritz' eval needs a multiple of that for evaluation nowadays. >>>> >>>>When not storing eval in transpositiontable but only in a special eval table, >>>>that will give a >= 50% lookuprate at evaltable (more likely 60%). >>>> >>>>So it makes sense to use an eval table for Fritz. >>>> >>>>Something crafty doesn't need as its eval is smaller than tiny. >>> >>>small != bad, however.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.