Author: Anthony Cozzie
Date: 09:00:19 04/16/04
Go up one level in this thread
On April 16, 2004 at 10:05:04, Vincent Diepeveen wrote: >On April 15, 2004 at 13:10:26, Robert Hyatt wrote: > >>On April 15, 2004 at 12:45:23, Vincent Diepeveen wrote: >> >>>On April 15, 2004 at 09:01:44, Robert Hyatt wrote: >>> >>>>On April 15, 2004 at 06:05:15, Joachim Rang wrote: >>>> >>>>>On April 14, 2004 at 22:49:39, Robert Hyatt wrote: >>>>> >>>>>>I just finished some HT on / HT off tests to see how things have changed in >>>>>>Crafty since some of the recent NUMA-related memory changes that were made. >>>>>> >>>>>>Point 1. HT now speeds Crafty up between 5 and 10% max. A year ago this was >>>>>>30%. What did I learn? Nothing new. Memory waits benefit HT. Eugene and I >>>>>>worked on removing several shared memory interactions which led to better cache >>>>>>utilization, less cache invalidates (very slow) and improved performance a good >>>>>>bit. But at the same time, now HT doesn't have the excessive memory waits it >>>>>>had before and so the speedup is not as good. >>>>>> >>>>>>Point 2. HT now actually slows things down due to SMP overhead. IE I lose 30% >>>>>>per CPU, roughly, due to SMP overhead. HT now only gives 5-10% back. This is a >>>>>>net loss. I am now running my dual with HT disabled... >>>>>> >>>>>>More as I get more data... Here is two data points however: >>>>>> >>>>>>pos1. cpus=2 (no HT) NPS = 2.07M time=18.13 >>>>>> cpus=4 NPS = 2.08M time=28.76 >>>>>> >>>>>>pos2. cpus=2 NPS = 1.87M time=58.48 >>>>>> cpus=4 NPS = 2.01M time=66.00 >>>>>> >>>>>>First pos HT helps almost none in NPS, costs 10 seconds in search overhead. >>>>>>Ugly. Position 2 gives about 5% more nps, but again the SMP overhead washes >>>>>>that out and there is a net loss. I should run the speedup tests several times, >>>>>>but the NPS numbers don't change much, and the speedup could change. But this >>>>>>offers enough.. >>>>> >>>>> >>>>>In a german Board someone postetd figures for the Fritzmark of Fritz 8. Fritz >>>>>gains still 25% form HT (in this specific position) >>>>> >>>>>cpus=2 NPS = 2.35 >>>>>cpus=4 NPS = 2,95 >>>>> >>>>>I have unfortunately no information about search time. >>>>> >>>>>Does that mean Fritz 8 is poorly optimized? >>>>> >>>>>regards Joachim >>>> >>>> >>>>It means it has some cache issues that can be fixed to speed it up further, yes. >>> >>>Not at all. >>> >>>Fritz is p4 hand optimized assembly currently. I expect him to work hard on an >>>opteron hand optimized assembly version from fritz now (probably already 1 year >>>working at it by now). >> >>Sorry, but you should stick to topics you know something about. SMT works best > >I guess this is your way of saying: "sorry i did not consider that it was a more >efficient program than crafty, and that the better SMT was caused by more hash >lookups than that i had taken into account could be profittable". > >>in programs where there are memory reads/writes that stall a thread. As you >>work out those stalls, SMT pays off less gain. My current numbers clearly show >>this as opposed to the numbers I (and others) posted when I first got my SMT >>box... > >You do 1 lookup to RAM. He's doing perhaps 3 lookups. > >You should do your math better before commenting on Fritz being inefficient >programmed. > >It was a grammar school conclusion you drew. > >>> >>>A possibility could be that last years Fritz evaluation function has become so >>>much slower than it was that it has most likely a need for an eval hashtable, >>>just like i use in DIEP already for many years. >>> >>>My guess is that it just uses more hashtables than crafty. Crafty isn't probing >>>in qsearch for example. DIEP is. Diep's doing a lot of more stuff in qsearch >>>than crafty. So using a transposition table there makes a lot more sense. >> >>That is possible. However, as I said, it is a trade-off. I took hash out of >>q-search and it was perfectly break-even. Tree grew a but but the search got > >Now start doing checks in qsearch and prune... ...note that you write the >opposite here as the first sentence in your posting. > >>proportionally faster. No gain or loss. Yet it results in lower bandwidth and >>with the PIV long cache line, it is probably (at least for Crafty) better than a >>break-even deal today. > >Diep effectively loses 20% speed to hashtable lookups in qsearch, but the netto >result is that it is about 20% faster than not doing them. So the savings must >be around 40% roughly. > >>> >>>All commercial programs that i know (junior's search is so different that i >>>would bet it is not the case with junior) are doing checks in qsearch. >> >>But he does not even hash probe in last ply of normal search.. >>And it appears he has no q-search. > >Junior is a very different case indeed. My guess would be they do without >qsearch but some dumb static exchange function. > >They never commented here upon it. > >I consider their search outdated. > >>> >>>So a possible alternative to evaltable would be hashing in qsearch. >>> >>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at >>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns. >>> >> >> >> >>Depends. Use big memory pages and it costs 150 ns. No TLB thrashing then. > >We all know that when things are in L2 cache it's faster. However when using >hashtables by definition you are busy with TLB trashing. > >Use big hashtables start doing lookups to your hashtable and it costs on average >400 ns on a dual k7 and your dual Xeon. Saying that under conditions A and B, >which hardly happen, that it is 150ns makes no sense. When it's in L2 cache at >opteron it just costs 13 cycles. Same type of comparision. > >The 400ns is just for 8 bytes. It's slower when you get more bytes... Bob is talking about Linux's huge tlb fs. X86 contains a spec for 4MB pages (usually used for the OS). With the larger pages, you can cover a lot more memory in the TLB and therefore miss a lot less. Unfortunately, the opteron appears to only have 8 entries available for the huge tlb pages, but I think the P4 can make any entry a 4MB (as opposed to the usual 4KB) page. anthony >> >>>That's even at 3Ghz just 1200 cycles. >>> >>>1 node on average costs assuming 1 mln nps at 3Ghz : 3000 cycles. >>> >>>Vaste majority of nodes do not get evaluated at all of course. >>> >>>That shows that Fritz' eval needs a multiple of that for evaluation nowadays. >>> >>>When not storing eval in transpositiontable but only in a special eval table, >>>that will give a >= 50% lookuprate at evaltable (more likely 60%). >>> >>>So it makes sense to use an eval table for Fritz. >>> >>>Something crafty doesn't need as its eval is smaller than tiny. >> >>small != bad, however. > >And that from an American mouth :)
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.