Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Some new hyper-threading info.

Author: Vincent Diepeveen

Date: 07:05:04 04/16/04

Go up one level in this thread


On April 15, 2004 at 13:10:26, Robert Hyatt wrote:

>On April 15, 2004 at 12:45:23, Vincent Diepeveen wrote:
>
>>On April 15, 2004 at 09:01:44, Robert Hyatt wrote:
>>
>>>On April 15, 2004 at 06:05:15, Joachim Rang wrote:
>>>
>>>>On April 14, 2004 at 22:49:39, Robert Hyatt wrote:
>>>>
>>>>>I just finished some HT on / HT off tests to see how things have changed in
>>>>>Crafty since some of the recent NUMA-related memory changes that were made.
>>>>>
>>>>>Point 1.  HT now speeds Crafty up between 5 and 10% max.  A year ago this was
>>>>>30%.  What did I learn?  Nothing new.  Memory waits benefit HT.  Eugene and I
>>>>>worked on removing several shared memory interactions which led to better cache
>>>>>utilization, less cache invalidates (very slow) and improved performance a good
>>>>>bit.  But at the same time, now HT doesn't have the excessive memory waits it
>>>>>had before and so the speedup is not as good.
>>>>>
>>>>>Point 2.  HT now actually slows things down due to SMP overhead.  IE I lose 30%
>>>>>per CPU, roughly, due to SMP overhead.  HT now only gives 5-10% back.  This is a
>>>>>net loss.  I am now running my dual with HT disabled...
>>>>>
>>>>>More as I get more data...  Here is two data points however:
>>>>>
>>>>>pos1.  cpus=2 (no HT)  NPS = 2.07M  time=18.13
>>>>>       cpus=4          NPS = 2.08M  time=28.76
>>>>>
>>>>>pos2.  cpus=2          NPS = 1.87M  time=58.48
>>>>>       cpus=4          NPS = 2.01M  time=66.00
>>>>>
>>>>>First pos HT helps almost none in NPS, costs 10 seconds in search overhead.
>>>>>Ugly.  Position 2 gives about 5% more nps, but again the SMP overhead washes
>>>>>that out and there is a net loss.  I should run the speedup tests several times,
>>>>>but the NPS numbers don't change much, and the speedup could change.  But this
>>>>>offers enough..
>>>>
>>>>
>>>>In a german Board someone postetd figures for the Fritzmark of Fritz 8. Fritz
>>>>gains still 25% form HT (in this specific position)
>>>>
>>>>cpus=2    NPS = 2.35
>>>>cpus=4    NPS = 2,95
>>>>
>>>>I have unfortunately no information about search time.
>>>>
>>>>Does that mean Fritz 8 is poorly optimized?
>>>>
>>>>regards Joachim
>>>
>>>
>>>It means it has some cache issues that can be fixed to speed it up further, yes.
>>
>>Not at all.
>>
>>Fritz is p4 hand optimized assembly currently. I expect him to work hard on an
>>opteron hand optimized assembly version from fritz now (probably already 1 year
>>working at it by now).
>
>Sorry, but you should stick to topics you know something about.  SMT works best

I guess this is your way of saying: "sorry i did not consider that it was a more
efficient program than crafty, and that the better SMT was caused by more hash
lookups than that i had taken into account could be profittable".

>in programs where there are memory reads/writes that stall a thread.  As you
>work out those stalls, SMT pays off less gain.  My current numbers clearly show
>this as opposed to the numbers I (and others) posted when I first got my SMT
>box...

You do 1 lookup to RAM. He's doing perhaps 3 lookups.

You should do your math better before commenting on Fritz being inefficient
programmed.

It was a grammar school conclusion you drew.

>>
>>A possibility could be that last years Fritz evaluation function has become so
>>much slower than it was that it has most likely a need for an eval hashtable,
>>just like i use in DIEP already for many years.
>>
>>My guess is that it just uses more hashtables than crafty. Crafty isn't probing
>>in qsearch for example. DIEP is. Diep's doing a lot of more stuff in qsearch
>>than crafty. So using a transposition table there makes a lot more sense.
>
>That is possible.  However, as I said, it is a trade-off.  I took hash out of
>q-search and it was perfectly break-even.  Tree grew a but but the search got

Now start doing checks in qsearch and prune... ...note that you write the
opposite here as the first sentence in your posting.

>proportionally faster.  No gain or loss.  Yet it results in lower bandwidth and
>with the PIV long cache line, it is probably (at least for Crafty) better than a
>break-even deal today.

Diep effectively loses 20% speed to hashtable lookups in qsearch, but the netto
result is that it is about 20% faster than not doing them. So the savings must
be around 40% roughly.

>>
>>All commercial programs that i know (junior's search is so different that i
>>would bet it is not the case with junior) are doing checks in qsearch.
>
>But he does not even hash probe in last ply of normal search..
>And it appears he has no q-search.

Junior is a very different case indeed. My guess would be they do without
qsearch but some dumb static exchange function.

They never commented here upon it.

I consider their search outdated.

>>
>>So a possible alternative to evaltable would be hashing in qsearch.
>>
>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at
>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns.
>>
>
>
>
>Depends.  Use big memory pages and it costs 150 ns.  No TLB thrashing then.

We all know that when things are in L2 cache it's faster. However when using
hashtables by definition you are busy with TLB trashing.

Use big hashtables start doing lookups to your hashtable and it costs on average
400 ns on a dual k7 and your dual Xeon. Saying that under conditions A and B,
which hardly happen, that it is 150ns makes no sense. When it's in L2 cache at
opteron it just costs 13 cycles. Same type of comparision.

The 400ns is just for 8 bytes. It's slower when you get more bytes...

>
>>That's even at 3Ghz just 1200 cycles.
>>
>>1 node on average costs assuming 1 mln nps at 3Ghz : 3000 cycles.
>>
>>Vaste majority of nodes do not get evaluated at all of course.
>>
>>That shows that Fritz' eval needs a multiple of that for evaluation nowadays.
>>
>>When not storing eval in transpositiontable but only in a special eval table,
>>that will give a >= 50% lookuprate at evaltable (more likely 60%).
>>
>>So it makes sense to use an eval table for Fritz.
>>
>>Something crafty doesn't need as its eval is smaller than tiny.
>
>small != bad, however.

And that from an American mouth :)



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.