Author: Robert Hyatt
Date: 19:58:51 05/23/03
Go up one level in this thread
On May 22, 2003 at 23:29:25, Aaron Gordon wrote: >On May 22, 2003 at 22:24:29, Robert Hyatt wrote: > >>On May 22, 2003 at 13:43:55, Tom Kerrigan wrote: >> >>>On May 21, 2003 at 22:20:57, Robert Hyatt wrote: >>> >>>>On May 21, 2003 at 15:48:46, Tom Kerrigan wrote: >>>> >>>>>On May 21, 2003 at 13:46:26, Robert Hyatt wrote: >>>>> >>>>>>On May 20, 2003 at 13:52:01, Tom Kerrigan wrote: >>>>>> >>>>>>>On May 20, 2003 at 00:26:49, Robert Hyatt wrote: >>>>>>> >>>>>>>>Actually it _does_ surprise me. The basic idea is that HT provides improved >>>>>>>>resource utilization within the CPU. IE would you prefer to have a dual 600mhz >>>>>>>>or a single 1000mhz machine? I'd generally prefer the dual 600, although for >>>>>>> >>>>>>>You're oversimplifying HT. When HT is running two threads, each thread only gets >>>>>>>half of the core's resources. So instead of your 1GHz vs. dual 600MHz situation, >>>>>>>what you have is more like a 1GHz Pentium 4 vs. a dual 1GHz Pentium. The dual >>>>>>>will usually be faster, but in many cases it will be slower, sometimes by a wide >>>>>>>margin. >>>>>> >>>>>>Not quite. Otherwise how do you explain my NPS _increase_ when using a second >>>>>>thread on a single physical cpu? >>>>>> >>>>>>The issue is that now things can be overlapped and more of the CPU core >>>>>>gets utilized for a greater percent of the total run-time... >>>>>> >>>>>>If it were just 50-50 then there would be _zero_ improvement for perfect >>>>>>algorithms, and a negative improvement for any algorithm with any overhead >>>>>>whatsoever... >>>>>> >>>>>>And the 50-50 doesn't even hold true for all cases, as my test results have >>>>>>shown, even though I have yet to find any reason for what is going on... >>>>> >>>>>Think a little bit before posting, Bob. I said that the chip's execution >>>>>resources were evenly split, I didn't say that the chip's performance is evently >>>>>split. That's just stupid. You have to figure in how those execution resources >>>>>are utilized and understand that adding more of these resources gives you >>>>>diminishing returns. >>>>> >>>>>-Tom >>>> >>>> >>>>You shold follow your own advice. If resources are split "50-50" then how >>>>can _my_ program produce a 70-30 split on occasion? >>>> >>>>It simply is _not_ possible. >>>> >>>>There is more to this than a simple explanation offers... >>> >>>Now you're getting off onto another topic here. >>> >> >>Read backward. _I_ did not "change the topic". >> >>I said that I don't see how it is possible for HT to slow a program down. >> >>You said "50-50" resource allocation might be an explanation. >> >>I said "that doesn't seem plausible because I have at least one example of >>two compute-bound threads that don't show a 50-50 balance on SMT." >> >>If Eugene is right, and I don't know as he was not sure and I haven't read >>anything similar to what he mentioned, that _could_ explain it (ie if some >>resources are split 50-50 between the two logical processors even if one >>could use more than the other due to the particular application being run. >>However that seems like a _bad_ design decision if it is true...) However >>there are probably other plausible explanations as well. What is the _real_ >>explanation? That will likely take some time to figure out. >> >> >>>Originally you were saying that it's impossible for HT to slow a program down >>>unless there was something wrong with the algorithm. >> >>And based on testing here, I pretty well stick with that. I won't say there >>is _no_ program that will run slower, but I haven't found one myself. And >>again, to be clear, we are talking about one program, one thread. Run on >>a machine with SMT on and SMT off. I've run that test repeatedly and can't >>find any penalty for one thread when turning SMT on. ANd I do mean _no >>penalty_ on anything I have tried. Kernel builds. Compiles. Running >>Crafty. Running various compute-bound applications like NAMD, a big monte-carlo >>simulation, etc... >> >>The idea really doesn't make sense, IMHO. >> >> >>> >>>Now you're back to complaining about your 70-30 split, which is only related to >>>the original topic because they both involve ratios like "50-50" and "70-30." >> >>That 70-30 was used simply to suggest that 50-50 is _not_ a "golden rule" in >>SMT resource allocation, apparently. Nothing more. >> >> >> >> >>> >>>-Tom > > >Hyatt, grab distributed.net's RC5-72 client, it supports multiple cpus and with >every dual system I've seen it run it on gets an exact 100% increase in >nodes/second. Now, it only spawns 1 thread per processor & isn't memory >intensive what so ever (that i've seen, only CPU clock speed affects results). A >P4 with HT gets HALF the speed of a P4 w/o HT in some of the results I've seen, >if you get the time try to verify that for me. I would have figured this would >have been one of the programs HT would shine at. Complete surprise to me... If >you could, grab the linux RC5-72 client at: What are they measuring? IE running two copies _should_ see each copy run about 1/2 as fast with SMT on, since each copy is getting roughly 50% of available cpu core resources when running the same instruction streams. Or do you mean something else? > >ftp://ftp.distributed.net/pub/dcti/current-client/dnetc-linux-x86-elf.tar.gz > >For those of you interested in running it in windows, here is the windows bin: >ftp://ftp.distributed.net/pub/dcti/current-client/dnetc-win32-x86.zip > >To run the benchmark all you need to do is type, ./dnetc -benchmark >This only uses one processor, you can configure it to display nodes & keys/sec >as the "live rate". This will use all of the processors (automatically), here >are the config files to test rc5 and OGR. > >dnetc.ini for RC5-72 showing the 'live rate' >[parameters] >id=test@test.com > >[misc] >project-priority=RC5-72,OGR=0 > >[display] >progress-indicator=rate > > >dnetc.ini for OGR-25 showing the 'live rate' >id=test@test.com > >[misc] >project-priority=OGR,RC5-72=0 > >[display] >progress-indicator=rate > > >From what I understand RC5/OGR uses mostly shifting, and from what I've seen the >P4 is extremely slow at that and HT may further hinder shifting. Just a guess >anyway. If you'd like some results to compare to, here is some of my Win2k >(slightly slower than the linux binary under redhat9) results.. > > >[Apr 03 04:35:29 UTC] RC5-72: Benchmark for core #5 (SS 2-pipe) > 0.00:00:17.79 [8,235,738 keys/sec] > >[Apr 03 04:36:20 UTC] OGR: Benchmark for core #0 (GARSP 5.13-A) > 0.00:00:16.98 [19,330,517 nodes/sec] > >This is a single Athlon XP at 2507MHz..
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.