Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Some new hyper-threading info.

Author: Anthony Cozzie

Date: 09:00:19 04/16/04

Go up one level in this thread


On April 16, 2004 at 10:05:04, Vincent Diepeveen wrote:

>On April 15, 2004 at 13:10:26, Robert Hyatt wrote:
>
>>On April 15, 2004 at 12:45:23, Vincent Diepeveen wrote:
>>
>>>On April 15, 2004 at 09:01:44, Robert Hyatt wrote:
>>>
>>>>On April 15, 2004 at 06:05:15, Joachim Rang wrote:
>>>>
>>>>>On April 14, 2004 at 22:49:39, Robert Hyatt wrote:
>>>>>
>>>>>>I just finished some HT on / HT off tests to see how things have changed in
>>>>>>Crafty since some of the recent NUMA-related memory changes that were made.
>>>>>>
>>>>>>Point 1.  HT now speeds Crafty up between 5 and 10% max.  A year ago this was
>>>>>>30%.  What did I learn?  Nothing new.  Memory waits benefit HT.  Eugene and I
>>>>>>worked on removing several shared memory interactions which led to better cache
>>>>>>utilization, less cache invalidates (very slow) and improved performance a good
>>>>>>bit.  But at the same time, now HT doesn't have the excessive memory waits it
>>>>>>had before and so the speedup is not as good.
>>>>>>
>>>>>>Point 2.  HT now actually slows things down due to SMP overhead.  IE I lose 30%
>>>>>>per CPU, roughly, due to SMP overhead.  HT now only gives 5-10% back.  This is a
>>>>>>net loss.  I am now running my dual with HT disabled...
>>>>>>
>>>>>>More as I get more data...  Here is two data points however:
>>>>>>
>>>>>>pos1.  cpus=2 (no HT)  NPS = 2.07M  time=18.13
>>>>>>       cpus=4          NPS = 2.08M  time=28.76
>>>>>>
>>>>>>pos2.  cpus=2          NPS = 1.87M  time=58.48
>>>>>>       cpus=4          NPS = 2.01M  time=66.00
>>>>>>
>>>>>>First pos HT helps almost none in NPS, costs 10 seconds in search overhead.
>>>>>>Ugly.  Position 2 gives about 5% more nps, but again the SMP overhead washes
>>>>>>that out and there is a net loss.  I should run the speedup tests several times,
>>>>>>but the NPS numbers don't change much, and the speedup could change.  But this
>>>>>>offers enough..
>>>>>
>>>>>
>>>>>In a german Board someone postetd figures for the Fritzmark of Fritz 8. Fritz
>>>>>gains still 25% form HT (in this specific position)
>>>>>
>>>>>cpus=2    NPS = 2.35
>>>>>cpus=4    NPS = 2,95
>>>>>
>>>>>I have unfortunately no information about search time.
>>>>>
>>>>>Does that mean Fritz 8 is poorly optimized?
>>>>>
>>>>>regards Joachim
>>>>
>>>>
>>>>It means it has some cache issues that can be fixed to speed it up further, yes.
>>>
>>>Not at all.
>>>
>>>Fritz is p4 hand optimized assembly currently. I expect him to work hard on an
>>>opteron hand optimized assembly version from fritz now (probably already 1 year
>>>working at it by now).
>>
>>Sorry, but you should stick to topics you know something about.  SMT works best
>
>I guess this is your way of saying: "sorry i did not consider that it was a more
>efficient program than crafty, and that the better SMT was caused by more hash
>lookups than that i had taken into account could be profittable".
>
>>in programs where there are memory reads/writes that stall a thread.  As you
>>work out those stalls, SMT pays off less gain.  My current numbers clearly show
>>this as opposed to the numbers I (and others) posted when I first got my SMT
>>box...
>
>You do 1 lookup to RAM. He's doing perhaps 3 lookups.
>
>You should do your math better before commenting on Fritz being inefficient
>programmed.
>
>It was a grammar school conclusion you drew.
>
>>>
>>>A possibility could be that last years Fritz evaluation function has become so
>>>much slower than it was that it has most likely a need for an eval hashtable,
>>>just like i use in DIEP already for many years.
>>>
>>>My guess is that it just uses more hashtables than crafty. Crafty isn't probing
>>>in qsearch for example. DIEP is. Diep's doing a lot of more stuff in qsearch
>>>than crafty. So using a transposition table there makes a lot more sense.
>>
>>That is possible.  However, as I said, it is a trade-off.  I took hash out of
>>q-search and it was perfectly break-even.  Tree grew a but but the search got
>
>Now start doing checks in qsearch and prune... ...note that you write the
>opposite here as the first sentence in your posting.
>
>>proportionally faster.  No gain or loss.  Yet it results in lower bandwidth and
>>with the PIV long cache line, it is probably (at least for Crafty) better than a
>>break-even deal today.
>
>Diep effectively loses 20% speed to hashtable lookups in qsearch, but the netto
>result is that it is about 20% faster than not doing them. So the savings must
>be around 40% roughly.
>
>>>
>>>All commercial programs that i know (junior's search is so different that i
>>>would bet it is not the case with junior) are doing checks in qsearch.
>>
>>But he does not even hash probe in last ply of normal search..
>>And it appears he has no q-search.
>
>Junior is a very different case indeed. My guess would be they do without
>qsearch but some dumb static exchange function.
>
>They never commented here upon it.
>
>I consider their search outdated.
>
>>>
>>>So a possible alternative to evaltable would be hashing in qsearch.
>>>
>>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at
>>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns.
>>>
>>
>>
>>
>>Depends.  Use big memory pages and it costs 150 ns.  No TLB thrashing then.
>
>We all know that when things are in L2 cache it's faster. However when using
>hashtables by definition you are busy with TLB trashing.
>
>Use big hashtables start doing lookups to your hashtable and it costs on average
>400 ns on a dual k7 and your dual Xeon. Saying that under conditions A and B,
>which hardly happen, that it is 150ns makes no sense. When it's in L2 cache at
>opteron it just costs 13 cycles. Same type of comparision.
>
>The 400ns is just for 8 bytes. It's slower when you get more bytes...

Bob is talking about Linux's huge tlb fs.  X86 contains a spec for 4MB pages
(usually used for the OS).  With the larger pages, you can cover a lot more
memory in the TLB and therefore miss a lot less.  Unfortunately, the opteron
appears to only have 8 entries available for the huge tlb pages, but I think the
P4 can make any entry a 4MB (as opposed to the usual 4KB) page.

anthony

>>
>>>That's even at 3Ghz just 1200 cycles.
>>>
>>>1 node on average costs assuming 1 mln nps at 3Ghz : 3000 cycles.
>>>
>>>Vaste majority of nodes do not get evaluated at all of course.
>>>
>>>That shows that Fritz' eval needs a multiple of that for evaluation nowadays.
>>>
>>>When not storing eval in transpositiontable but only in a special eval table,
>>>that will give a >= 50% lookuprate at evaltable (more likely 60%).
>>>
>>>So it makes sense to use an eval table for Fritz.
>>>
>>>Something crafty doesn't need as its eval is smaller than tiny.
>>
>>small != bad, however.
>
>And that from an American mouth :)



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.