Computer Chess Club Archives


Search

Terms

Messages

Subject: Try RC5 w/ HT

Author: Aaron Gordon

Date: 20:29:25 05/22/03

Go up one level in this thread


On May 22, 2003 at 22:24:29, Robert Hyatt wrote:

>On May 22, 2003 at 13:43:55, Tom Kerrigan wrote:
>
>>On May 21, 2003 at 22:20:57, Robert Hyatt wrote:
>>
>>>On May 21, 2003 at 15:48:46, Tom Kerrigan wrote:
>>>
>>>>On May 21, 2003 at 13:46:26, Robert Hyatt wrote:
>>>>
>>>>>On May 20, 2003 at 13:52:01, Tom Kerrigan wrote:
>>>>>
>>>>>>On May 20, 2003 at 00:26:49, Robert Hyatt wrote:
>>>>>>
>>>>>>>Actually it _does_ surprise me.  The basic idea is that HT provides improved
>>>>>>>resource utilization within the CPU.  IE would you prefer to have a dual 600mhz
>>>>>>>or a single 1000mhz machine?  I'd generally prefer the dual 600, although for
>>>>>>
>>>>>>You're oversimplifying HT. When HT is running two threads, each thread only gets
>>>>>>half of the core's resources. So instead of your 1GHz vs. dual 600MHz situation,
>>>>>>what you have is more like a 1GHz Pentium 4 vs. a dual 1GHz Pentium. The dual
>>>>>>will usually be faster, but in many cases it will be slower, sometimes by a wide
>>>>>>margin.
>>>>>
>>>>>Not quite.  Otherwise how do you explain my NPS _increase_ when using a second
>>>>>thread on a single physical cpu?
>>>>>
>>>>>The issue is that now things can be overlapped and more of the CPU core
>>>>>gets utilized for a greater percent of the total run-time...
>>>>>
>>>>>If it were just 50-50 then there would be _zero_ improvement for perfect
>>>>>algorithms, and a negative improvement for any algorithm with any overhead
>>>>>whatsoever...
>>>>>
>>>>>And the 50-50 doesn't even hold true for all cases, as my test results have
>>>>>shown, even though I have yet to find any reason for what is going on...
>>>>
>>>>Think a little bit before posting, Bob. I said that the chip's execution
>>>>resources were evenly split, I didn't say that the chip's performance is evently
>>>>split. That's just stupid. You have to figure in how those execution resources
>>>>are utilized and understand that adding more of these resources gives you
>>>>diminishing returns.
>>>>
>>>>-Tom
>>>
>>>
>>>You shold follow your own advice.  If resources are split "50-50" then how
>>>can _my_ program produce a 70-30 split on occasion?
>>>
>>>It simply is _not_ possible.
>>>
>>>There is more to this than a simple explanation offers...
>>
>>Now you're getting off onto another topic here.
>>
>
>Read backward.  _I_ did not "change the topic".
>
>I said that I don't see how it is possible for HT to slow a program down.
>
>You said "50-50" resource allocation might be an explanation.
>
>I said "that doesn't seem plausible because I have at least one example of
>two compute-bound threads that don't show a 50-50 balance on SMT."
>
>If Eugene is right, and I don't know as he was not sure and I haven't read
>anything similar to what he mentioned, that _could_ explain it (ie if some
>resources are split 50-50 between the two logical processors even if one
>could use more than the other due to the particular application being run.
>However that seems like a _bad_ design decision if it is true...)  However
>there are probably other plausible explanations as well.  What is the _real_
>explanation?  That will likely take some time to figure out.
>
>
>>Originally you were saying that it's impossible for HT to slow a program down
>>unless there was something wrong with the algorithm.
>
>And based on testing here, I pretty well stick with that.  I won't say there
>is _no_ program that will run slower, but I haven't found one myself.  And
>again, to be clear, we are talking about one program, one thread.  Run on
>a machine with SMT on and SMT off.  I've run that test repeatedly and can't
>find any penalty for one thread when turning SMT on.  ANd I do mean _no
>penalty_ on anything I have tried.  Kernel builds.  Compiles.  Running
>Crafty.  Running various compute-bound applications like NAMD, a big monte-carlo
>simulation, etc...
>
>The idea really doesn't make sense, IMHO.
>
>
>>
>>Now you're back to complaining about your 70-30 split, which is only related to
>>the original topic because they both involve ratios like "50-50" and "70-30."
>
>That 70-30 was used simply to suggest that 50-50 is _not_ a "golden rule" in
>SMT resource allocation, apparently.  Nothing more.
>
>
>
>
>>
>>-Tom


Hyatt, grab distributed.net's RC5-72 client, it supports multiple cpus and with
every dual system I've seen it run it on gets an exact 100% increase in
nodes/second. Now, it only spawns 1 thread per processor & isn't memory
intensive what so ever (that i've seen, only CPU clock speed affects results). A
P4 with HT gets HALF the speed of a P4 w/o HT in some of the results I've seen,
if you get the time try to verify that for me. I would have figured this would
have been one of the programs HT would shine at. Complete surprise to me...  If
you could, grab the linux RC5-72 client at:

ftp://ftp.distributed.net/pub/dcti/current-client/dnetc-linux-x86-elf.tar.gz

For those of you interested in running it in windows, here is the windows bin:
ftp://ftp.distributed.net/pub/dcti/current-client/dnetc-win32-x86.zip

To run the benchmark all you need to do is type, ./dnetc -benchmark
This only uses one processor, you can configure it to display nodes & keys/sec
as the "live rate". This will use all of the processors (automatically), here
are the config files to test rc5 and OGR.

dnetc.ini for RC5-72 showing the 'live rate'
[parameters]
id=test@test.com

[misc]
project-priority=RC5-72,OGR=0

[display]
progress-indicator=rate


dnetc.ini for OGR-25 showing the 'live rate'
id=test@test.com

[misc]
project-priority=OGR,RC5-72=0

[display]
progress-indicator=rate


From what I understand RC5/OGR uses mostly shifting, and from what I've seen the
P4 is extremely slow at that and HT may further hinder shifting. Just a guess
anyway. If you'd like some results to compare to, here is some of my Win2k
(slightly slower than the linux binary under redhat9) results..


[Apr 03 04:35:29 UTC] RC5-72: Benchmark for core #5 (SS 2-pipe)
                      0.00:00:17.79 [8,235,738 keys/sec]

[Apr 03 04:36:20 UTC] OGR: Benchmark for core #0 (GARSP 5.13-A)
                      0.00:00:16.98 [19,330,517 nodes/sec]

This is a single Athlon XP at 2507MHz..



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.