Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading

Author: Eugene Nalimov

Date: 20:19:22 12/13/02

Go up one level in this thread


P4 cache line size is 128 for L2 cache, 64 for L1 cache.

Thanks,
Eugene

On December 13, 2002 at 22:56:17, Robert Hyatt wrote:

>On December 13, 2002 at 16:41:07, Vincent Diepeveen wrote:
>
>>On December 13, 2002 at 16:03:47, Robert Hyatt wrote:
>>
>>your math is wrong for many reasons.
>>
>>0) it isn't 32 bytes but 64 bytes that you get at once
>>   garantueed.
>
>Depends on the processor.  For PIII and earlier, it is _32_ bytes.  For
>the PIV it is 128 bytes.  I think AMD is 64...
>
>
>>1) you can garantuee that you need just cacheline
>
>Yes you can, by making sure you probe to a starting address that is
>divisible by the cache line size exactly.  Are you doing that?  Are
>you sure your table is initially aligned on a multiple of cache line
>size?  Didn't think so.  You can't control malloc() that well yet...
>And it isn't smart enough to know it should do that, particularly when
>the alignment is processor dependent.
>
>
>
>
>>2) even if you need 2, then those aren't 400 clocks each
>>   cache line but the first one is 400 and the second
>>   one is only a very small part of that (consider the
>>   bandwidth the memory delivers!)
>
>Try again.  You burst load one cache line and that is all.  The first 8
>bytes comes across after a long latency.  The rest of the line bursts in
>quickly.  For the next cache miss, back you go to the long latency.
>
>
>>3) you get more transpositions from 4 probes. it works better than 2
>>   and a *lot* better.
>
>I don't think a "lot" better.  I ran this test.  The first Crafty versions
>used a N probe approach as that was what I did in Cray Blitz.  Multiple
>probes are somewhat better.  But not "a lot" better.  And the memory bandwidth
>hurts.  Probing consecutive addresses is bad from a theoretical hashing point
>of view as it leads to "chaining".  Probing non-consecutive addresses is bad
>from a cache pre-fetch point of view.
>
>>
>>3) compare with the sureness of 2 very slow cache lines which you
>>   get from seperated parts in memory which is a garantueed 800 clocks
>>   or 50% of the total system time possibly.
>
>
>Ditto for yours.  800 clocks too...
>
>
>>
>>Best regards,
>>Vincent
>>
>>>On December 13, 2002 at 15:43:52, Vincent Diepeveen wrote:
>>>
>>>>On December 13, 2002 at 14:33:47, Robert Hyatt wrote:
>>>>
>>>>>On December 13, 2002 at 14:08:29, Vincent Diepeveen wrote:
>>>>>
>>>>>>hello,
>>>>>>
>>>>>>Here some testresults of DIEP thanks to Chad Cowan at an
>>>>>>asus motherboard with HT turned on (amazingly no longer
>>>>>>SMT called, i forgot which manufacturer calls it HT and
>>>>>>which one SMT. I guess it's Hyperthreading now for intel).
>>>>>
>>>>>It is _both_.  SMT and HT.  You can find either term listed on Intel's
>>>>>web site.
>>>>>
>>>>>
>>>>>>
>>>>>>HT turned on in all cases:
>>>>>>
>>>>>>bus 533Mhz memory 133Mhz (DDR SDRAM cas2)
>>>>>>single cpu P4 3.105Ghz (bus 135 Mhz by default, not 133) : 101394
>>>>>>single cpu P4 3.105Ghz now 2 processes DIEP              : 120095
>>>>>>
>>>>>>So speedup like 18% for HT. Not bad. Not good either, knowing diep
>>>>>>hardly locks.
>>>>>
>>>>>It isn't just a lock issue.  If both threads are banging on memory, it can't
>>>>>run much faster, as it still serializes the memory reads and they are slow.
>>>>>
>>>>>
>>>>>Hmm.. Aren't you the same person that was saying "hyper-threading doesn't
>>>>>work" and "hyper-threading only works on machines that won't be available
>>>>>for 1-2 years in the future"??  And that "Nalimov is running on a machine
>>>>>that nobody can buy"  and "the 2.8ghz xeon doesn't support hyper-threading"?
>>>>>and so forth???
>>>>
>>>>intel marketing department is already 2 years busy with a SMT campaign.
>>>>Only now in the USA there are a few 2.8Ghz Xeons and 3Ghz P4s.
>>>>
>>>>they say only the 3Ghz P4s and xeon mp's have HT/smt support whereas
>>>>the old SMT worked badly for me and diep.
>>>
>>>What "old SMT"??  There isn't any.
>>>
>>>
>>>>
>>>>also 18% speedup ain't much knowing their cpu is already quite some slower
>>>>than the k7 is.
>>>>
>>>>Would waitforsingleobject() in windoze speedup more than hammering into
>>>>the same cache line?
>>>
>>>
>>>Read the intel website about spinlocks.  Spinlocks are better for short-duration
>>>waits.
>>>O/S blocking calls are better for long waits.  This has always been the case,
>>>but since I
>>>don't have any "long wait times" I only use spins.
>>>
>>>
>>>
>>>
>>>>
>>>>Means i lose some 2ms each process though to wake it up when it's asleep.
>>>>
>>>>Not a holy grail solution either seemingly.
>>>
>>>You don't have to take that penalty.  Just spin "properly".
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>However there is 1 problem i have with it when i compare that speed
>>>>>>of the same version with 2.4Ghz northwood.
>>>>>>
>>>>>>That 2.4Ghz is exactly the speed of a K7 at 1.6ghz
>>>>>>
>>>>>>Now the same K7 same version logs:
>>>>>>    single cpu : 82499
>>>>>>    dual       : 154293
>>>>>>
>>>>>>Note that the k7 has way way slower RAM and chipset. 133Mhz registered cas 2.5
>>>>>>i guess versus fast cas 2 (like 2T less for latency, so 10 versus 12T or
>>>>>>something) for the P4. The P4 was a single cpu.
>>>>>>
>>>>>>but here the math for those who still read here that's interesting to
>>>>>>hear.
>>>>>>
>>>>>>Single cpu speed difference is:
>>>>>>  P4 3.06Ghz is faster : 22.9%
>>>>>>
>>>>>>Based upon the speed where it is clocked at (3105Mhz)
>>>>>>we would expect a speedup of 3.105 / 2.4 = 29.4%
>>>>>>
>>>>>>So somehow we lose around 7% in the process.
>>>>>
>>>>>Memory is no faster.  So there is going to be a loss every time the cpu clock is
>>>>>ramped up a notch.  Always has been.  Always will be until DRAM disappears.
>>>>
>>>>Yes i/o bottleneck will be more and more getting a problem.
>>>>
>>>>>>
>>>>>>Now it wins another 18% or so when it gets run with 2 processes.
>>>>>>If i compare that with a single cpu K7 to get the relative
>>>>>>speed of a P4 Ghz versus a K7 Ghz then we get next compare:
>>>>>>
>>>>>>1.6Ghz * (120k / 82k) = 2.33Ghz
>>>>>>
>>>>>>so a 2.33Ghz K7 should be equally fast to a P4 at such a speed.
>>>>>>Of course assuming linearly scaling.
>>>>>>
>>>>>>Now we calculate what 1Ghz K7 compares to in speed with P4: 1.33
>>>>>>
>>>>>>So DDR ram proves to be the big winner for the P4. SMT in itself
>>>>>>is just a trick that works for me because my parallellism is
>>>>>>pretty ok and most likely not for everyone.
>>>>>
>>>>>Works just fine for me too, as I have already reported and as has Eugene
>>>>>and others...
>>>>
>>>>initially i thought P4 would suck forever, but i didn't realize the
>>>>major negative impact RDRAM had at the time.
>>>>
>>>>>>
>>>>>>Now of course it's questionable whether that 18% speedup in nodes
>>>>>>a second also results in actual positive speedup in plydepth.
>>>>>>
>>>>>>For DIEP it is, but it's not so impressive at all.
>>>>>
>>>>>Nope.  but that is not a processor issue, that is a search issue.  The _cpu_ is
>>>>>faster with SMT on.  Just because a chess engine can't use that very well
>>>>>doesn't mean
>>>>>that other applications without the search overhead issue won't benefit, and in
>>>>>fact
>>>>>they do benefit pretty well...
>>>>
>>>>diep's making perfect use of it. of course it's evaluating most of the time
>>>>and if that's not in the trace cache that code, then the processor has
>>>>the problem that it can only serve 1 process at a time.
>>>
>>>It only serves one at a time most of the time.  It is interlacing micro-ops, but
>>>most of
>>>the time one thread is blocked waiting on a memory read, which is the slowest
>>>thing
>>>on the processor.
>>>
>>>>
>>>>>The interesting thing I have noted is that the SMT benefit just about offsets my
>>>>>parallel
>>>>>search overhead for the typical case.   If I run a single thread on my 2.8 xeon,
>>>>>I get a search
>>>>>time of X.  If I run four threads, to use both cpus with HT enabled, I get a
>>>>>search time of
>>>>>very close to X/2.  The 20-30% speedup by HT is just about what it takes to
>>>>>offset the
>>>>
>>>>20-30% is quite overestimation of the speedup by HT. You gotta improve
>>>>hashtable management a bit then. like 4 probes within the same cache line
>>>>instead of a 2 table approach.
>>>
>>>Your 3-4 probe idea is not so good for reasons I mentioned already.  You are
>>>going to
>>>_average_ two cache line reads anyway, because the first table entry is not
>>>guaranteed to
>>>be on the leading edge of the cache line.  On average, you are going to load
>>>stuff that is
>>>_before_ the entry you want as well as after.  And on average, with 16 byte
>>>entries, I
>>>would expect to see (with 32 byte cache lines) that one probe takes one cache
>>>miss,
>>>the second will take another cache miss 1/2 of the time.  Four probes guarantees
>>>that you
>>>are going to get at least two cache misses most of the time, which is _exactly_
>>>what I get
>>>with two separate tables.  No loss or gain...
>>>
>>>
>>>
>>>
>>>>
>>>>crafty needs 1600 clocks a node on average (k7 timing, bit more for p4
>>>>but same idea).
>>>>
>>>>each ram reference (ddr ram) you lose like 400 clocks.
>>>>
>>>>if you use 1 table you lose 400 clocks and can do 4 probes in it.
>>>
>>>No you can't for the reason I gave above.  Unless you force your hash probe
>>>address to
>>>be a multiple of the cache line size, so that you always get the first entry you
>>>want in
>>>the beginning of the cache line.
>>>
>>>
>>>>
>>>>if you use 2 tables as you do now you use 800 clocks.
>>>>
>>>>it is trivial that if the 2 extra threads in crafty can save out a few
>>>>references to the DDR ram and get it from the L2 cache, that that's
>>>>a pretty important basis for nodes a second win in crafty when using
>>>>mt=4 at a dual xeon.
>>>
>>>
>>>I can turn off the second probe and my nps doesn't increase much.  I'll post the
>>>data
>>>later tonight.  That simply means that the second probe is not a problem.  And
>>>the way
>>>the pipeline works, both loads are sent off back-to-back so that there is not a
>>>clean
>>>400 clock wait for the first and another 400 clock wait for the second.
>>>
>>>
>>>>
>>>>>extra search overhead caused by the extra processor.  Which means that for the
>>>>>time being,
>>>>>it is possible to search almost exactly twice as fast using two cpus, although
>>>>>this comparison
>>>>>is not exactly a correct way to compare things.
>>>>
>>>>no it's not possible to search 2 times faster. the problem is that the
>>>>L1 cache and the trace cache are too small and it can't even feed 1
>>>>processor when decoding instructions, not to mention 2.
>>>
>>>It is _definitely_ possible.  I'll post the data as I have already done this
>>>once...
>>>The drawback is that while it is searching 2x faster, it is using 4 threads, so
>>>that 2x is
>>>not exactly a fair comparison.
>>>
>>>
>>>>
>>>>apart from that there is other technical problems when both processors
>>>>want something.
>>>
>>>So?  this is common in operating system process scheduling also..
>>>
>>>
>>>>
>>>>but 18% speedup from it is better than nothing.
>>>>
>>>>too bad that because of that 18% speedup the processor is 2 times more
>>>>expensive.
>>>
>>>
>>>The 2.8 xeons are going for around $450.00...
>>>
>>>
>>>>
>>>
>>>
>>>>>>
>>>>>>Because a dual Xeon 2.8Ghz which i will assume also having a compare
>>>>>>of 1.4 then (assuming not cas2 ddr ram but of course ecc registered
>>>>>>which eats extra time)
>>>>>
>>>>>However the xeon has 2-way memory interleaving which runs the bandwidth way up
>>>>>compared to the desktop PIV system.
>>>>
>>>>2 way memory interleaving for 4 processes doesn't kick butt.
>>>
>>>It does better than no interleaving, by a factor of 2.0...
>>>
>>>
>>>
>>>
>>>>
>>>>the problem is you lose time to the ECC and registered features of the
>>>>memory you need for the dual. of course that's the case for all duals.
>>>>both K7 MP and Xeon suffer from that regrettably.
>>>
>>>That is not true.  The duals do _not_ have to have ECC ram.  And it doesn't
>>>appear to be
>>>any slower than non-ECC ram although I will be able to test that before long as
>>>we have
>>>some non-ECC machines coming in.
>>>
>>>
>>>
>>>>
>>>>A result is that single cpu tests can be carried out much faster in general.
>>>>
>>>>>
>>>>>>
>>>>>>That means that the equivalent K7 will be a dual K7 2.0Ghz, thereby
>>>>>>still not taking into account 3 things
>>>>>>
>>>>>>  a) my diep version was msvc compiled with processorpack (sp4)
>>>>>>     so it was simply not optimized for K7 at all, but more for p4
>>>>>>     than it was optimized for K7. Not using MMX of course (would
>>>>>>     slow down on P4 and let the K7 look relatively better).
>>>>>>  b) speedup at 4 processors is a lot worse than at 2 processors
>>>>>>     so when i run diep with 4 processes at the dual Xeon 2.8
>>>>>>     the expectation is that the K7 dual 2.0 Ghz will outgun it
>>>>>>     by quite some margin.
>>>>>>  c) that dual k7 2.0Ghz is less than half the price of a dual P4 2.8Ghz
>>>>>>
>>>>>
>>>>>There are no dual PIV's at the moment.  Only dual xeons.  Xeons are _not_
>>>>>PIV's....  For several reasons that can be found on the Intel web site.  That's
>>>>>why
>>>>
>>>>they also fit in different slots for example. like 603 for xeon and 478 or
>>>>something for the P4.
>>>
>>>xeon has 603 and 604 pin sockets to separate them.
>>>
>>>>
>>>>>xeons are considered to be their "server class chips" while the PIV is their
>>>>>"desktop
>>>>>class chip".
>>>>
>>>>the core is the same however. So a 3.06Ghz Xeon when it gets released won't
>>>>be faster single cpu when put single cpu in a mainboard than a P4 3.06Ghz.
>>>
>>>Probably will as the xeons use a different chipset which can support
>>>interleaving where
>>>the desktop chipsets don't.
>>>
>>>>
>>>>With some luck by the time they release a 3.06Ghz Xeon they have improved
>>>>the SMT another bit.
>>>>
>>>>Seems to me they working for years to get that SMT/HT slowly better working.
>>>
>>>Not "for years".  It was announced as a coming thing a couple of years ago and
>>>several
>>>vendors have been discussing the idea.  And they are going to increase the ratio
>>>of physical
>>>to logical cpus before long also...
>>>
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>Best regards,
>>>>>>Vincent



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.