Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: OT -> if you had the choice...

Author: Robert Hyatt
Date: 06:38:34 11/09/03
On November 09, 2003 at 01:07:26, Eugene Nalimov wrote:

>On November 08, 2003 at 21:40:20, Robert Hyatt wrote:
>
>>On November 08, 2003 at 14:55:15, Robert Hyatt wrote:
>>
>>>On November 05, 2003 at 18:57:08, Eugene Nalimov wrote:
>>>
>>>>On November 05, 2003 at 18:17:03, Robert Hyatt wrote:
>>>>
>>>>>On November 05, 2003 at 16:41:51, Eugene Nalimov wrote:
>>>>>
>>>>>>On November 05, 2003 at 09:54:13, Robert Hyatt wrote:
>>>>>>
>>>>>>>On November 05, 2003 at 05:22:16, Ed Schröder wrote:
>>>>>>>
>>>>>>>>If you the choice between:
>>>>>>>>
>>>>>>>>1) AMD Opteron 244, 1.8 Ghz, S-940 Box
>>>>>>>>
>>>>>>>>and:
>>>>>>>>
>>>>>>>>2) AMD MP 2600+, 266Mhz
>>>>>>>>
>>>>>>>>then what would be the best choice regarding speed.
>>>>>>>>
>>>>>>>>I wonder...
>>>>>>>>
>>>>>>>>Ed
>>>>>>>
>>>>>>>for me, I'd take the opteron.
>>>>>>>
>>>>>>>Crafty gets about 2M nps on a 1.8ghz opteron...  single processor.
>>>>>>
>>>>>>Not exactly. Following are 2 log files from (new version of) Crafty running on
>>>>>>1.8GHz quad Opteron system. Run time vary from run to run, but those are typical
>>>>>>ones
>>>>>>
>>>>>>1 CPU:  1,762knps
>>>>>>4 CPUs: 6,856knps
>>>>>
>>>>>OK... I had done the calculation wrong.  I thought that 6.8M for 4 was
>>>>>basically 3.2X faster than 1, due to the NUMA scaling issues.  It looks
>>>>>from the above that it is now scaling almost 4:1 which is great.  :)
>>>>>
>>>>>Now if my dual xeon would just scale 2.0  :)
>>>>
>>>>What is current number? I believe we improved it when you made some global
>>>>per-thread one, no?
>>>>
>>>>Thanks,
>>>>Eugene
>>>
>>>
>>>Looks better (I just tested.)  Seems to be back to the magic
>>>1.9X (raw NPS is 1.9X faster with two processors than with
>>>1.
>>>
>>>Here's the raw data.
>>>
>>>one cpu:
>>>
>>>             time=1:25  cpu=99%  mat=0  n=85541805  fh=91%  nps=998k
>>>             time=55.41  cpu=99%  mat=0  n=62193826  fh=95%  nps=1122k
>>>             time=1:40  cpu=99%  mat=-1  n=89355667  fh=94%  nps=886k
>>>             time=1:18  cpu=99%  mat=0  n=82339318  fh=92%  nps=1050k
>>>
>>>two cpus (SMT off):
>>>             time=49.12  cpu=198%  mat=0  n=91626204  fh=91%  nps=1865k
>>>             time=27.55  cpu=198%  mat=0  n=58868942  fh=95%  nps=2136k
>>>             time=1:00  cpu=198%  mat=-1  n=101092946  fh=94%  nps=1669k
>>>             time=45.56  cpu=197%  mat=0  n=89351627  fh=92%  nps=1961k
>>>
>>>four cpus (SMT on):
>>>              time=50.32  cpu=392%  mat=0  n=105665041  fh=91%  nps=2099k
>>>              time=23.92  cpu=388%  mat=0  n=57409674  fh=95%  nps=2400k
>>>              time=57.60  cpu=392%  mat=-1  n=108568676  fh=93%  nps=1884k
>>>              time=40.88  cpu=396%  mat=0  n=91017384  fh=92%  nps=2226k
>>
>>
>>I didn't have time to analyze the data above, but I notice that since I have
>>been doing the NUMA-specific fixes, which also have to do with cache coherency
>>issues, my SMT performance is no longer what it was a while back.  IE from
>>the raw NPS numbers, it seems to be about 10% faster now with SMT on than off.
>>Probably explained by the less frequent cache line loading for a specific shared
>>variable that was causing problems earlier...  SMT on is still faster with a
>>parallel search, for me, but the difference is not as stark as it was 6 months
>>ago when this topic came up initially...
>
>I hope that your SMT nps didn't worsen, right? Just your non-SMT nps went up?

Correct.  The 4-processor test above was with 2 physical processors, 4 logical
processors.  the 4-processor test saw a roughly 10% raw NPS improvement..

 >If
>so, explanation is simple -- you have less cache conflicts, so your thread
>ususally not blocked, so there is less "idle" resources to be utilized by
>another thread.
>
>The best SMT numbers I observed were achieved either on program with lot of
>unpredictable branches (e.g. (de)compressor, where with good algorith branches
>are unpredictable -- otherwise there would be some regularity, that can be used
>to obtain better compression ratio), or with server-like code with *lot* of
>cache misses (and unpredictable branches as well).
>
>Thanks,
>Eugene

That was my thought for the reduced improvement.  Used to see 20-30% raw NPS
increase with hyperthreading enabled.  Now it is down to 10%, but still
worthwhile as you will note that on three of the four positions given above,
the elapsed time was lower with 4 processors than with 2.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.