Author: Robert Hyatt
Date: 06:38:34 11/09/03
Go up one level in this thread
On November 09, 2003 at 01:07:26, Eugene Nalimov wrote: >On November 08, 2003 at 21:40:20, Robert Hyatt wrote: > >>On November 08, 2003 at 14:55:15, Robert Hyatt wrote: >> >>>On November 05, 2003 at 18:57:08, Eugene Nalimov wrote: >>> >>>>On November 05, 2003 at 18:17:03, Robert Hyatt wrote: >>>> >>>>>On November 05, 2003 at 16:41:51, Eugene Nalimov wrote: >>>>> >>>>>>On November 05, 2003 at 09:54:13, Robert Hyatt wrote: >>>>>> >>>>>>>On November 05, 2003 at 05:22:16, Ed Schröder wrote: >>>>>>> >>>>>>>>If you the choice between: >>>>>>>> >>>>>>>>1) AMD Opteron 244, 1.8 Ghz, S-940 Box >>>>>>>> >>>>>>>>and: >>>>>>>> >>>>>>>>2) AMD MP 2600+, 266Mhz >>>>>>>> >>>>>>>>then what would be the best choice regarding speed. >>>>>>>> >>>>>>>>I wonder... >>>>>>>> >>>>>>>>Ed >>>>>>> >>>>>>>for me, I'd take the opteron. >>>>>>> >>>>>>>Crafty gets about 2M nps on a 1.8ghz opteron... single processor. >>>>>> >>>>>>Not exactly. Following are 2 log files from (new version of) Crafty running on >>>>>>1.8GHz quad Opteron system. Run time vary from run to run, but those are typical >>>>>>ones >>>>>> >>>>>>1 CPU: 1,762knps >>>>>>4 CPUs: 6,856knps >>>>> >>>>>OK... I had done the calculation wrong. I thought that 6.8M for 4 was >>>>>basically 3.2X faster than 1, due to the NUMA scaling issues. It looks >>>>>from the above that it is now scaling almost 4:1 which is great. :) >>>>> >>>>>Now if my dual xeon would just scale 2.0 :) >>>> >>>>What is current number? I believe we improved it when you made some global >>>>per-thread one, no? >>>> >>>>Thanks, >>>>Eugene >>> >>> >>>Looks better (I just tested.) Seems to be back to the magic >>>1.9X (raw NPS is 1.9X faster with two processors than with >>>1. >>> >>>Here's the raw data. >>> >>>one cpu: >>> >>> time=1:25 cpu=99% mat=0 n=85541805 fh=91% nps=998k >>> time=55.41 cpu=99% mat=0 n=62193826 fh=95% nps=1122k >>> time=1:40 cpu=99% mat=-1 n=89355667 fh=94% nps=886k >>> time=1:18 cpu=99% mat=0 n=82339318 fh=92% nps=1050k >>> >>>two cpus (SMT off): >>> time=49.12 cpu=198% mat=0 n=91626204 fh=91% nps=1865k >>> time=27.55 cpu=198% mat=0 n=58868942 fh=95% nps=2136k >>> time=1:00 cpu=198% mat=-1 n=101092946 fh=94% nps=1669k >>> time=45.56 cpu=197% mat=0 n=89351627 fh=92% nps=1961k >>> >>>four cpus (SMT on): >>> time=50.32 cpu=392% mat=0 n=105665041 fh=91% nps=2099k >>> time=23.92 cpu=388% mat=0 n=57409674 fh=95% nps=2400k >>> time=57.60 cpu=392% mat=-1 n=108568676 fh=93% nps=1884k >>> time=40.88 cpu=396% mat=0 n=91017384 fh=92% nps=2226k >> >> >>I didn't have time to analyze the data above, but I notice that since I have >>been doing the NUMA-specific fixes, which also have to do with cache coherency >>issues, my SMT performance is no longer what it was a while back. IE from >>the raw NPS numbers, it seems to be about 10% faster now with SMT on than off. >>Probably explained by the less frequent cache line loading for a specific shared >>variable that was causing problems earlier... SMT on is still faster with a >>parallel search, for me, but the difference is not as stark as it was 6 months >>ago when this topic came up initially... > >I hope that your SMT nps didn't worsen, right? Just your non-SMT nps went up? Correct. The 4-processor test above was with 2 physical processors, 4 logical processors. the 4-processor test saw a roughly 10% raw NPS improvement.. >If >so, explanation is simple -- you have less cache conflicts, so your thread >ususally not blocked, so there is less "idle" resources to be utilized by >another thread. > >The best SMT numbers I observed were achieved either on program with lot of >unpredictable branches (e.g. (de)compressor, where with good algorith branches >are unpredictable -- otherwise there would be some regularity, that can be used >to obtain better compression ratio), or with server-like code with *lot* of >cache misses (and unpredictable branches as well). > >Thanks, >Eugene That was my thought for the reduced improvement. Used to see 20-30% raw NPS increase with hyperthreading enabled. Now it is down to 10%, but still worthwhile as you will note that on three of the four positions given above, the elapsed time was lower with 4 processors than with 2.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.