Author: Vincent Diepeveen
Date: 07:13:16 04/16/04
Go up one level in this thread
On April 16, 2004 at 09:19:27, Anthony Cozzie wrote: >On April 16, 2004 at 05:47:42, Vasik Rajlich wrote: > >>On April 15, 2004 at 13:10:26, Robert Hyatt wrote: >> >>>On April 15, 2004 at 12:45:23, Vincent Diepeveen wrote: >>> >>>>On April 15, 2004 at 09:01:44, Robert Hyatt wrote: >>>> >>>>>On April 15, 2004 at 06:05:15, Joachim Rang wrote: >>>>> >>>>>>On April 14, 2004 at 22:49:39, Robert Hyatt wrote: >>>>>> >>>>>>>I just finished some HT on / HT off tests to see how things have changed in >>>>>>>Crafty since some of the recent NUMA-related memory changes that were made. >>>>>>> >>>>>>>Point 1. HT now speeds Crafty up between 5 and 10% max. A year ago this was >>>>>>>30%. What did I learn? Nothing new. Memory waits benefit HT. Eugene and I >>>>>>>worked on removing several shared memory interactions which led to better cache >>>>>>>utilization, less cache invalidates (very slow) and improved performance a good >>>>>>>bit. But at the same time, now HT doesn't have the excessive memory waits it >>>>>>>had before and so the speedup is not as good. >>>>>>> >>>>>>>Point 2. HT now actually slows things down due to SMP overhead. IE I lose 30% >>>>>>>per CPU, roughly, due to SMP overhead. HT now only gives 5-10% back. This is a >>>>>>>net loss. I am now running my dual with HT disabled... >>>>>>> >>>>>>>More as I get more data... Here is two data points however: >>>>>>> >>>>>>>pos1. cpus=2 (no HT) NPS = 2.07M time=18.13 >>>>>>> cpus=4 NPS = 2.08M time=28.76 >>>>>>> >>>>>>>pos2. cpus=2 NPS = 1.87M time=58.48 >>>>>>> cpus=4 NPS = 2.01M time=66.00 >>>>>>> >>>>>>>First pos HT helps almost none in NPS, costs 10 seconds in search overhead. >>>>>>>Ugly. Position 2 gives about 5% more nps, but again the SMP overhead washes >>>>>>>that out and there is a net loss. I should run the speedup tests several times, >>>>>>>but the NPS numbers don't change much, and the speedup could change. But this >>>>>>>offers enough.. >>>>>> >>>>>> >>>>>>In a german Board someone postetd figures for the Fritzmark of Fritz 8. Fritz >>>>>>gains still 25% form HT (in this specific position) >>>>>> >>>>>>cpus=2 NPS = 2.35 >>>>>>cpus=4 NPS = 2,95 >>>>>> >>>>>>I have unfortunately no information about search time. >>>>>> >>>>>>Does that mean Fritz 8 is poorly optimized? >>>>>> >>>>>>regards Joachim >>>>> >>>>> >>>>>It means it has some cache issues that can be fixed to speed it up further, yes. >>>> >>>>Not at all. >>>> >>>>Fritz is p4 hand optimized assembly currently. I expect him to work hard on an >>>>opteron hand optimized assembly version from fritz now (probably already 1 year >>>>working at it by now). >>> >>>Sorry, but you should stick to topics you know something about. SMT works best >>>in programs where there are memory reads/writes that stall a thread. As you >>>work out those stalls, SMT pays off less gain. My current numbers clearly show >>>this as opposed to the numbers I (and others) posted when I first got my SMT >>>box... >>> >>>> >>>>A possibility could be that last years Fritz evaluation function has become so >>>>much slower than it was that it has most likely a need for an eval hashtable, >>>>just like i use in DIEP already for many years. >>>> >>>>My guess is that it just uses more hashtables than crafty. Crafty isn't probing >>>>in qsearch for example. DIEP is. Diep's doing a lot of more stuff in qsearch >>>>than crafty. So using a transposition table there makes a lot more sense. >>> >>>That is possible. However, as I said, it is a trade-off. I took hash out of >>>q-search and it was perfectly break-even. Tree grew a but but the search got >>>proportionally faster. No gain or loss. Yet it results in lower bandwidth and >>>with the PIV long cache line, it is probably (at least for Crafty) better than a >>>break-even deal today. >>> >>>> >>>>All commercial programs that i know (junior's search is so different that i >>>>would bet it is not the case with junior) are doing checks in qsearch. >>> >>>But he does not even hash probe in last ply of normal search.. >>> >>>And it appears he has no q-search. >>> >> >>Why do you say this? >> >>I guess the only alternative to q-search is some sort of an SEE at depth == 0. >>Or is there some other possibility? >> >>>> >>>>So a possible alternative to evaltable would be hashing in qsearch. >>>> >>>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at >>>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns. >>>> >>> >>> >>> >>>Depends. Use big memory pages and it costs 150 ns. No TLB thrashing then. >>> >> >>Based on some reading that I did on the k8, it seemed that a memory lookup was >>around 150 cycles there. (And 250 cycles on k7.) >> >>Did I misunderstand? Or does this number change when you use multiple >>processors? >> >>If so, then hashing should be done differently on multiple processors than on >>single processors. For example, ETC would behave differently. >> >>Vas > >Memory latency is something like 100-150 ns on Athlon XP. The problem is when >you miss in the TLB: then you have to do two more memory lookups to find the >stuff you wanted in the first place. So memory latency = 400 ns on dual K7 MP mainboards in total with 133Mhz to get hashtable entries. Idem dual P4's. The single cpu P4's are around 280 ns and opteron i didn't accurately measure yet. Only have an A64 here at office. >Opteron latency varies depending if it is "close" memory or not. >anthony > >>> >>> >>>>That's even at 3Ghz just 1200 cycles. >>>> >>>>1 node on average costs assuming 1 mln nps at 3Ghz : 3000 cycles. >>>> >>>>Vaste majority of nodes do not get evaluated at all of course. >>>> >>>>That shows that Fritz' eval needs a multiple of that for evaluation nowadays. >>>> >>>>When not storing eval in transpositiontable but only in a special eval table, >>>>that will give a >= 50% lookuprate at evaltable (more likely 60%). >>>> >>>>So it makes sense to use an eval table for Fritz. >>>> >>>>Something crafty doesn't need as its eval is smaller than tiny. >>> >>>small != bad, however.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.