Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Some new hyper-threading info.

Author: Vasik Rajlich
Date: 02:31:53 04/17/04
On April 16, 2004 at 10:08:34, Vincent Diepeveen wrote:

>On April 16, 2004 at 05:47:42, Vasik Rajlich wrote:
>
>>On April 15, 2004 at 13:10:26, Robert Hyatt wrote:
>>
>>>On April 15, 2004 at 12:45:23, Vincent Diepeveen wrote:
>>>
>>>>On April 15, 2004 at 09:01:44, Robert Hyatt wrote:
>>>>
>>>>>On April 15, 2004 at 06:05:15, Joachim Rang wrote:
>>>>>
>>>>>>On April 14, 2004 at 22:49:39, Robert Hyatt wrote:
>>>>>>
>>>>>>>I just finished some HT on / HT off tests to see how things have changed in
>>>>>>>Crafty since some of the recent NUMA-related memory changes that were made.
>>>>>>>
>>>>>>>Point 1.  HT now speeds Crafty up between 5 and 10% max.  A year ago this was
>>>>>>>30%.  What did I learn?  Nothing new.  Memory waits benefit HT.  Eugene and I
>>>>>>>worked on removing several shared memory interactions which led to better cache
>>>>>>>utilization, less cache invalidates (very slow) and improved performance a good
>>>>>>>bit.  But at the same time, now HT doesn't have the excessive memory waits it
>>>>>>>had before and so the speedup is not as good.
>>>>>>>
>>>>>>>Point 2.  HT now actually slows things down due to SMP overhead.  IE I lose 30%
>>>>>>>per CPU, roughly, due to SMP overhead.  HT now only gives 5-10% back.  This is a
>>>>>>>net loss.  I am now running my dual with HT disabled...
>>>>>>>
>>>>>>>More as I get more data...  Here is two data points however:
>>>>>>>
>>>>>>>pos1.  cpus=2 (no HT)  NPS = 2.07M  time=18.13
>>>>>>>       cpus=4          NPS = 2.08M  time=28.76
>>>>>>>
>>>>>>>pos2.  cpus=2          NPS = 1.87M  time=58.48
>>>>>>>       cpus=4          NPS = 2.01M  time=66.00
>>>>>>>
>>>>>>>First pos HT helps almost none in NPS, costs 10 seconds in search overhead.
>>>>>>>Ugly.  Position 2 gives about 5% more nps, but again the SMP overhead washes
>>>>>>>that out and there is a net loss.  I should run the speedup tests several times,
>>>>>>>but the NPS numbers don't change much, and the speedup could change.  But this
>>>>>>>offers enough..
>>>>>>
>>>>>>
>>>>>>In a german Board someone postetd figures for the Fritzmark of Fritz 8. Fritz
>>>>>>gains still 25% form HT (in this specific position)
>>>>>>
>>>>>>cpus=2    NPS = 2.35
>>>>>>cpus=4    NPS = 2,95
>>>>>>
>>>>>>I have unfortunately no information about search time.
>>>>>>
>>>>>>Does that mean Fritz 8 is poorly optimized?
>>>>>>
>>>>>>regards Joachim
>>>>>
>>>>>
>>>>>It means it has some cache issues that can be fixed to speed it up further, yes.
>>>>
>>>>Not at all.
>>>>
>>>>Fritz is p4 hand optimized assembly currently. I expect him to work hard on an
>>>>opteron hand optimized assembly version from fritz now (probably already 1 year
>>>>working at it by now).
>>>
>>>Sorry, but you should stick to topics you know something about.  SMT works best
>>>in programs where there are memory reads/writes that stall a thread.  As you
>>>work out those stalls, SMT pays off less gain.  My current numbers clearly show
>>>this as opposed to the numbers I (and others) posted when I first got my SMT
>>>box...
>>>
>>>>
>>>>A possibility could be that last years Fritz evaluation function has become so
>>>>much slower than it was that it has most likely a need for an eval hashtable,
>>>>just like i use in DIEP already for many years.
>>>>
>>>>My guess is that it just uses more hashtables than crafty. Crafty isn't probing
>>>>in qsearch for example. DIEP is. Diep's doing a lot of more stuff in qsearch
>>>>than crafty. So using a transposition table there makes a lot more sense.
>>>
>>>That is possible.  However, as I said, it is a trade-off.  I took hash out of
>>>q-search and it was perfectly break-even.  Tree grew a but but the search got
>>>proportionally faster.  No gain or loss.  Yet it results in lower bandwidth and
>>>with the PIV long cache line, it is probably (at least for Crafty) better than a
>>>break-even deal today.
>>>
>>>>
>>>>All commercial programs that i know (junior's search is so different that i
>>>>would bet it is not the case with junior) are doing checks in qsearch.
>>>
>>>But he does not even hash probe in last ply of normal search..
>>>
>>>And it appears he has no q-search.
>>>
>>
>>Why do you say this?
>
>I drew that conclusion a few years ago. It doesn't need to be the case nowadays
>in junior.
>
>>I guess the only alternative to q-search is some sort of an SEE at depth == 0.
>>Or is there some other possibility?
>
>Suppose last few plies you just do a tactical verification search or whatever
>and that you rely upon piece square tables.
>
>You can do a slow 'makemove' of course and then evaluate.
>
>You can also throw away the entire qsearch and make a small list of attacked
>pieces for white and attacked pieces of black.
>
>Then return the evaluation + canwin(side);
>
>Keep the canwin function simple.
>
>That's very quick.
>
>By the way this is already in a book i read from Jaap v/d Herik. Written around
>1984 or so...

I think the right way to handle this should be sometimes canwin, sometimes
q-search, depending on the position.

If the question is canwin vs q-search, then in my testing q-search is the big
winner.

I don't know if it's because q-search is tactically stronger, or because
q-search evaluates the actual positions you get.

Vas

>
>>>>
>>>>So a possible alternative to evaltable would be hashing in qsearch.
>>>>
>>>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at
>>>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns.
>>>>
>>>
>>>
>>>
>>>Depends.  Use big memory pages and it costs 150 ns.  No TLB thrashing then.
>>>
>>
>>Based on some reading that I did on the k8, it seemed that a memory lookup was
>>around 150 cycles there. (And 250 cycles on k7.)
>>
>>Did I misunderstand? Or does this number change when you use multiple
>>processors?
>>
>>If so, then hashing should be done differently on multiple processors than on
>>single processors. For example, ETC would behave differently.
>>
>>Vas
>>
>>>
>>>
>>>>That's even at 3Ghz just 1200 cycles.
>>>>
>>>>1 node on average costs assuming 1 mln nps at 3Ghz : 3000 cycles.
>>>>
>>>>Vaste majority of nodes do not get evaluated at all of course.
>>>>
>>>>That shows that Fritz' eval needs a multiple of that for evaluation nowadays.
>>>>
>>>>When not storing eval in transpositiontable but only in a special eval table,
>>>>that will give a >= 50% lookuprate at evaltable (more likely 60%).
>>>>
>>>>So it makes sense to use an eval table for Fritz.
>>>>
>>>>Something crafty doesn't need as its eval is smaller than tiny.
>>>
>>>small != bad, however.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.