Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading

Author: Vincent Diepeveen

Date: 12:43:52 12/13/02

Go up one level in this thread


On December 13, 2002 at 14:33:47, Robert Hyatt wrote:

>On December 13, 2002 at 14:08:29, Vincent Diepeveen wrote:
>
>>hello,
>>
>>Here some testresults of DIEP thanks to Chad Cowan at an
>>asus motherboard with HT turned on (amazingly no longer
>>SMT called, i forgot which manufacturer calls it HT and
>>which one SMT. I guess it's Hyperthreading now for intel).
>
>It is _both_.  SMT and HT.  You can find either term listed on Intel's
>web site.
>
>
>>
>>HT turned on in all cases:
>>
>>bus 533Mhz memory 133Mhz (DDR SDRAM cas2)
>>single cpu P4 3.105Ghz (bus 135 Mhz by default, not 133) : 101394
>>single cpu P4 3.105Ghz now 2 processes DIEP              : 120095
>>
>>So speedup like 18% for HT. Not bad. Not good either, knowing diep
>>hardly locks.
>
>It isn't just a lock issue.  If both threads are banging on memory, it can't
>run much faster, as it still serializes the memory reads and they are slow.
>
>
>Hmm.. Aren't you the same person that was saying "hyper-threading doesn't
>work" and "hyper-threading only works on machines that won't be available
>for 1-2 years in the future"??  And that "Nalimov is running on a machine
>that nobody can buy"  and "the 2.8ghz xeon doesn't support hyper-threading"?
>and so forth???

intel marketing department is already 2 years busy with a SMT campaign.
Only now in the USA there are a few 2.8Ghz Xeons and 3Ghz P4s.

they say only the 3Ghz P4s and xeon mp's have HT/smt support whereas
the old SMT worked badly for me and diep.

also 18% speedup ain't much knowing their cpu is already quite some slower
than the k7 is.

Would waitforsingleobject() in windoze speedup more than hammering into
the same cache line?

Means i lose some 2ms each process though to wake it up when it's asleep.

Not a holy grail solution either seemingly.

>
>
>>
>>However there is 1 problem i have with it when i compare that speed
>>of the same version with 2.4Ghz northwood.
>>
>>That 2.4Ghz is exactly the speed of a K7 at 1.6ghz
>>
>>Now the same K7 same version logs:
>>    single cpu : 82499
>>    dual       : 154293
>>
>>Note that the k7 has way way slower RAM and chipset. 133Mhz registered cas 2.5
>>i guess versus fast cas 2 (like 2T less for latency, so 10 versus 12T or
>>something) for the P4. The P4 was a single cpu.
>>
>>but here the math for those who still read here that's interesting to
>>hear.
>>
>>Single cpu speed difference is:
>>  P4 3.06Ghz is faster : 22.9%
>>
>>Based upon the speed where it is clocked at (3105Mhz)
>>we would expect a speedup of 3.105 / 2.4 = 29.4%
>>
>>So somehow we lose around 7% in the process.
>
>Memory is no faster.  So there is going to be a loss every time the cpu clock is
>ramped up a notch.  Always has been.  Always will be until DRAM disappears.

Yes i/o bottleneck will be more and more getting a problem.

>>
>>Now it wins another 18% or so when it gets run with 2 processes.
>>If i compare that with a single cpu K7 to get the relative
>>speed of a P4 Ghz versus a K7 Ghz then we get next compare:
>>
>>1.6Ghz * (120k / 82k) = 2.33Ghz
>>
>>so a 2.33Ghz K7 should be equally fast to a P4 at such a speed.
>>Of course assuming linearly scaling.
>>
>>Now we calculate what 1Ghz K7 compares to in speed with P4: 1.33
>>
>>So DDR ram proves to be the big winner for the P4. SMT in itself
>>is just a trick that works for me because my parallellism is
>>pretty ok and most likely not for everyone.
>
>Works just fine for me too, as I have already reported and as has Eugene
>and others...

initially i thought P4 would suck forever, but i didn't realize the
major negative impact RDRAM had at the time.

>>
>>Now of course it's questionable whether that 18% speedup in nodes
>>a second also results in actual positive speedup in plydepth.
>>
>>For DIEP it is, but it's not so impressive at all.
>
>Nope.  but that is not a processor issue, that is a search issue.  The _cpu_ is
>faster with SMT on.  Just because a chess engine can't use that very well
>doesn't mean
>that other applications without the search overhead issue won't benefit, and in
>fact
>they do benefit pretty well...

diep's making perfect use of it. of course it's evaluating most of the time
and if that's not in the trace cache that code, then the processor has
the problem that it can only serve 1 process at a time.

>The interesting thing I have noted is that the SMT benefit just about offsets my
>parallel
>search overhead for the typical case.   If I run a single thread on my 2.8 xeon,
>I get a search
>time of X.  If I run four threads, to use both cpus with HT enabled, I get a
>search time of
>very close to X/2.  The 20-30% speedup by HT is just about what it takes to
>offset the

20-30% is quite overestimation of the speedup by HT. You gotta improve
hashtable management a bit then. like 4 probes within the same cache line
instead of a 2 table approach.

crafty needs 1600 clocks a node on average (k7 timing, bit more for p4
but same idea).

each ram reference (ddr ram) you lose like 400 clocks.

if you use 1 table you lose 400 clocks and can do 4 probes in it.

if you use 2 tables as you do now you use 800 clocks.

it is trivial that if the 2 extra threads in crafty can save out a few
references to the DDR ram and get it from the L2 cache, that that's
a pretty important basis for nodes a second win in crafty when using
mt=4 at a dual xeon.

>extra search overhead caused by the extra processor.  Which means that for the
>time being,
>it is possible to search almost exactly twice as fast using two cpus, although
>this comparison
>is not exactly a correct way to compare things.

no it's not possible to search 2 times faster. the problem is that the
L1 cache and the trace cache are too small and it can't even feed 1
processor when decoding instructions, not to mention 2.

apart from that there is other technical problems when both processors
want something.

but 18% speedup from it is better than nothing.

too bad that because of that 18% speedup the processor is 2 times more
expensive.

>>
>>Because a dual Xeon 2.8Ghz which i will assume also having a compare
>>of 1.4 then (assuming not cas2 ddr ram but of course ecc registered
>>which eats extra time)
>
>However the xeon has 2-way memory interleaving which runs the bandwidth way up
>compared to the desktop PIV system.

2 way memory interleaving for 4 processes doesn't kick butt.

the problem is you lose time to the ECC and registered features of the
memory you need for the dual. of course that's the case for all duals.
both K7 MP and Xeon suffer from that regrettably.

A result is that single cpu tests can be carried out much faster in general.

>
>>
>>That means that the equivalent K7 will be a dual K7 2.0Ghz, thereby
>>still not taking into account 3 things
>>
>>  a) my diep version was msvc compiled with processorpack (sp4)
>>     so it was simply not optimized for K7 at all, but more for p4
>>     than it was optimized for K7. Not using MMX of course (would
>>     slow down on P4 and let the K7 look relatively better).
>>  b) speedup at 4 processors is a lot worse than at 2 processors
>>     so when i run diep with 4 processes at the dual Xeon 2.8
>>     the expectation is that the K7 dual 2.0 Ghz will outgun it
>>     by quite some margin.
>>  c) that dual k7 2.0Ghz is less than half the price of a dual P4 2.8Ghz
>>
>
>There are no dual PIV's at the moment.  Only dual xeons.  Xeons are _not_
>PIV's....  For several reasons that can be found on the Intel web site.  That's
>why

they also fit in different slots for example. like 603 for xeon and 478 or
something for the P4.

>xeons are considered to be their "server class chips" while the PIV is their
>"desktop
>class chip".

the core is the same however. So a 3.06Ghz Xeon when it gets released won't
be faster single cpu when put single cpu in a mainboard than a P4 3.06Ghz.

With some luck by the time they release a 3.06Ghz Xeon they have improved
the SMT another bit.

Seems to me they working for years to get that SMT/HT slowly better working.

>
>
>>Best regards,
>>Vincent



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.