Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Some new hyper-threading info.

Author: Anthony Cozzie

Date: 06:19:27 04/16/04

Go up one level in this thread


On April 16, 2004 at 05:47:42, Vasik Rajlich wrote:

>On April 15, 2004 at 13:10:26, Robert Hyatt wrote:
>
>>On April 15, 2004 at 12:45:23, Vincent Diepeveen wrote:
>>
>>>On April 15, 2004 at 09:01:44, Robert Hyatt wrote:
>>>
>>>>On April 15, 2004 at 06:05:15, Joachim Rang wrote:
>>>>
>>>>>On April 14, 2004 at 22:49:39, Robert Hyatt wrote:
>>>>>
>>>>>>I just finished some HT on / HT off tests to see how things have changed in
>>>>>>Crafty since some of the recent NUMA-related memory changes that were made.
>>>>>>
>>>>>>Point 1.  HT now speeds Crafty up between 5 and 10% max.  A year ago this was
>>>>>>30%.  What did I learn?  Nothing new.  Memory waits benefit HT.  Eugene and I
>>>>>>worked on removing several shared memory interactions which led to better cache
>>>>>>utilization, less cache invalidates (very slow) and improved performance a good
>>>>>>bit.  But at the same time, now HT doesn't have the excessive memory waits it
>>>>>>had before and so the speedup is not as good.
>>>>>>
>>>>>>Point 2.  HT now actually slows things down due to SMP overhead.  IE I lose 30%
>>>>>>per CPU, roughly, due to SMP overhead.  HT now only gives 5-10% back.  This is a
>>>>>>net loss.  I am now running my dual with HT disabled...
>>>>>>
>>>>>>More as I get more data...  Here is two data points however:
>>>>>>
>>>>>>pos1.  cpus=2 (no HT)  NPS = 2.07M  time=18.13
>>>>>>       cpus=4          NPS = 2.08M  time=28.76
>>>>>>
>>>>>>pos2.  cpus=2          NPS = 1.87M  time=58.48
>>>>>>       cpus=4          NPS = 2.01M  time=66.00
>>>>>>
>>>>>>First pos HT helps almost none in NPS, costs 10 seconds in search overhead.
>>>>>>Ugly.  Position 2 gives about 5% more nps, but again the SMP overhead washes
>>>>>>that out and there is a net loss.  I should run the speedup tests several times,
>>>>>>but the NPS numbers don't change much, and the speedup could change.  But this
>>>>>>offers enough..
>>>>>
>>>>>
>>>>>In a german Board someone postetd figures for the Fritzmark of Fritz 8. Fritz
>>>>>gains still 25% form HT (in this specific position)
>>>>>
>>>>>cpus=2    NPS = 2.35
>>>>>cpus=4    NPS = 2,95
>>>>>
>>>>>I have unfortunately no information about search time.
>>>>>
>>>>>Does that mean Fritz 8 is poorly optimized?
>>>>>
>>>>>regards Joachim
>>>>
>>>>
>>>>It means it has some cache issues that can be fixed to speed it up further, yes.
>>>
>>>Not at all.
>>>
>>>Fritz is p4 hand optimized assembly currently. I expect him to work hard on an
>>>opteron hand optimized assembly version from fritz now (probably already 1 year
>>>working at it by now).
>>
>>Sorry, but you should stick to topics you know something about.  SMT works best
>>in programs where there are memory reads/writes that stall a thread.  As you
>>work out those stalls, SMT pays off less gain.  My current numbers clearly show
>>this as opposed to the numbers I (and others) posted when I first got my SMT
>>box...
>>
>>>
>>>A possibility could be that last years Fritz evaluation function has become so
>>>much slower than it was that it has most likely a need for an eval hashtable,
>>>just like i use in DIEP already for many years.
>>>
>>>My guess is that it just uses more hashtables than crafty. Crafty isn't probing
>>>in qsearch for example. DIEP is. Diep's doing a lot of more stuff in qsearch
>>>than crafty. So using a transposition table there makes a lot more sense.
>>
>>That is possible.  However, as I said, it is a trade-off.  I took hash out of
>>q-search and it was perfectly break-even.  Tree grew a but but the search got
>>proportionally faster.  No gain or loss.  Yet it results in lower bandwidth and
>>with the PIV long cache line, it is probably (at least for Crafty) better than a
>>break-even deal today.
>>
>>>
>>>All commercial programs that i know (junior's search is so different that i
>>>would bet it is not the case with junior) are doing checks in qsearch.
>>
>>But he does not even hash probe in last ply of normal search..
>>
>>And it appears he has no q-search.
>>
>
>Why do you say this?
>
>I guess the only alternative to q-search is some sort of an SEE at depth == 0.
>Or is there some other possibility?
>
>>>
>>>So a possible alternative to evaltable would be hashing in qsearch.
>>>
>>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at
>>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns.
>>>
>>
>>
>>
>>Depends.  Use big memory pages and it costs 150 ns.  No TLB thrashing then.
>>
>
>Based on some reading that I did on the k8, it seemed that a memory lookup was
>around 150 cycles there. (And 250 cycles on k7.)
>
>Did I misunderstand? Or does this number change when you use multiple
>processors?
>
>If so, then hashing should be done differently on multiple processors than on
>single processors. For example, ETC would behave differently.
>
>Vas

Memory latency is something like 100-150 ns on Athlon XP.  The problem is when
you miss in the TLB: then you have to do two more memory lookups to find the
stuff you wanted in the first place.

Opteron latency varies depending if it is "close" memory or not.

anthony

>>
>>
>>>That's even at 3Ghz just 1200 cycles.
>>>
>>>1 node on average costs assuming 1 mln nps at 3Ghz : 3000 cycles.
>>>
>>>Vaste majority of nodes do not get evaluated at all of course.
>>>
>>>That shows that Fritz' eval needs a multiple of that for evaluation nowadays.
>>>
>>>When not storing eval in transpositiontable but only in a special eval table,
>>>that will give a >= 50% lookuprate at evaltable (more likely 60%).
>>>
>>>So it makes sense to use an eval table for Fritz.
>>>
>>>Something crafty doesn't need as its eval is smaller than tiny.
>>
>>small != bad, however.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.