Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Some new hyper-threading info.

Author: Vincent Diepeveen

Date: 07:13:16 04/16/04

Go up one level in this thread


On April 16, 2004 at 09:19:27, Anthony Cozzie wrote:

>On April 16, 2004 at 05:47:42, Vasik Rajlich wrote:
>
>>On April 15, 2004 at 13:10:26, Robert Hyatt wrote:
>>
>>>On April 15, 2004 at 12:45:23, Vincent Diepeveen wrote:
>>>
>>>>On April 15, 2004 at 09:01:44, Robert Hyatt wrote:
>>>>
>>>>>On April 15, 2004 at 06:05:15, Joachim Rang wrote:
>>>>>
>>>>>>On April 14, 2004 at 22:49:39, Robert Hyatt wrote:
>>>>>>
>>>>>>>I just finished some HT on / HT off tests to see how things have changed in
>>>>>>>Crafty since some of the recent NUMA-related memory changes that were made.
>>>>>>>
>>>>>>>Point 1.  HT now speeds Crafty up between 5 and 10% max.  A year ago this was
>>>>>>>30%.  What did I learn?  Nothing new.  Memory waits benefit HT.  Eugene and I
>>>>>>>worked on removing several shared memory interactions which led to better cache
>>>>>>>utilization, less cache invalidates (very slow) and improved performance a good
>>>>>>>bit.  But at the same time, now HT doesn't have the excessive memory waits it
>>>>>>>had before and so the speedup is not as good.
>>>>>>>
>>>>>>>Point 2.  HT now actually slows things down due to SMP overhead.  IE I lose 30%
>>>>>>>per CPU, roughly, due to SMP overhead.  HT now only gives 5-10% back.  This is a
>>>>>>>net loss.  I am now running my dual with HT disabled...
>>>>>>>
>>>>>>>More as I get more data...  Here is two data points however:
>>>>>>>
>>>>>>>pos1.  cpus=2 (no HT)  NPS = 2.07M  time=18.13
>>>>>>>       cpus=4          NPS = 2.08M  time=28.76
>>>>>>>
>>>>>>>pos2.  cpus=2          NPS = 1.87M  time=58.48
>>>>>>>       cpus=4          NPS = 2.01M  time=66.00
>>>>>>>
>>>>>>>First pos HT helps almost none in NPS, costs 10 seconds in search overhead.
>>>>>>>Ugly.  Position 2 gives about 5% more nps, but again the SMP overhead washes
>>>>>>>that out and there is a net loss.  I should run the speedup tests several times,
>>>>>>>but the NPS numbers don't change much, and the speedup could change.  But this
>>>>>>>offers enough..
>>>>>>
>>>>>>
>>>>>>In a german Board someone postetd figures for the Fritzmark of Fritz 8. Fritz
>>>>>>gains still 25% form HT (in this specific position)
>>>>>>
>>>>>>cpus=2    NPS = 2.35
>>>>>>cpus=4    NPS = 2,95
>>>>>>
>>>>>>I have unfortunately no information about search time.
>>>>>>
>>>>>>Does that mean Fritz 8 is poorly optimized?
>>>>>>
>>>>>>regards Joachim
>>>>>
>>>>>
>>>>>It means it has some cache issues that can be fixed to speed it up further, yes.
>>>>
>>>>Not at all.
>>>>
>>>>Fritz is p4 hand optimized assembly currently. I expect him to work hard on an
>>>>opteron hand optimized assembly version from fritz now (probably already 1 year
>>>>working at it by now).
>>>
>>>Sorry, but you should stick to topics you know something about.  SMT works best
>>>in programs where there are memory reads/writes that stall a thread.  As you
>>>work out those stalls, SMT pays off less gain.  My current numbers clearly show
>>>this as opposed to the numbers I (and others) posted when I first got my SMT
>>>box...
>>>
>>>>
>>>>A possibility could be that last years Fritz evaluation function has become so
>>>>much slower than it was that it has most likely a need for an eval hashtable,
>>>>just like i use in DIEP already for many years.
>>>>
>>>>My guess is that it just uses more hashtables than crafty. Crafty isn't probing
>>>>in qsearch for example. DIEP is. Diep's doing a lot of more stuff in qsearch
>>>>than crafty. So using a transposition table there makes a lot more sense.
>>>
>>>That is possible.  However, as I said, it is a trade-off.  I took hash out of
>>>q-search and it was perfectly break-even.  Tree grew a but but the search got
>>>proportionally faster.  No gain or loss.  Yet it results in lower bandwidth and
>>>with the PIV long cache line, it is probably (at least for Crafty) better than a
>>>break-even deal today.
>>>
>>>>
>>>>All commercial programs that i know (junior's search is so different that i
>>>>would bet it is not the case with junior) are doing checks in qsearch.
>>>
>>>But he does not even hash probe in last ply of normal search..
>>>
>>>And it appears he has no q-search.
>>>
>>
>>Why do you say this?
>>
>>I guess the only alternative to q-search is some sort of an SEE at depth == 0.
>>Or is there some other possibility?
>>
>>>>
>>>>So a possible alternative to evaltable would be hashing in qsearch.
>>>>
>>>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at
>>>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns.
>>>>
>>>
>>>
>>>
>>>Depends.  Use big memory pages and it costs 150 ns.  No TLB thrashing then.
>>>
>>
>>Based on some reading that I did on the k8, it seemed that a memory lookup was
>>around 150 cycles there. (And 250 cycles on k7.)
>>
>>Did I misunderstand? Or does this number change when you use multiple
>>processors?
>>
>>If so, then hashing should be done differently on multiple processors than on
>>single processors. For example, ETC would behave differently.
>>
>>Vas
>
>Memory latency is something like 100-150 ns on Athlon XP.  The problem is when
>you miss in the TLB: then you have to do two more memory lookups to find the
>stuff you wanted in the first place.

So memory latency = 400 ns on dual K7 MP mainboards in total with 133Mhz to get
hashtable entries.

Idem dual P4's.

The single cpu P4's are around 280 ns and opteron i didn't accurately measure
yet. Only have an A64 here at office.

>Opteron latency varies depending if it is "close" memory or not.
>anthony
>
>>>
>>>
>>>>That's even at 3Ghz just 1200 cycles.
>>>>
>>>>1 node on average costs assuming 1 mln nps at 3Ghz : 3000 cycles.
>>>>
>>>>Vaste majority of nodes do not get evaluated at all of course.
>>>>
>>>>That shows that Fritz' eval needs a multiple of that for evaluation nowadays.
>>>>
>>>>When not storing eval in transpositiontable but only in a special eval table,
>>>>that will give a >= 50% lookuprate at evaltable (more likely 60%).
>>>>
>>>>So it makes sense to use an eval table for Fritz.
>>>>
>>>>Something crafty doesn't need as its eval is smaller than tiny.
>>>
>>>small != bad, however.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.