Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading

Author: Eugene Nalimov
Date: 22:31:13 12/13/02
On December 14, 2002 at 00:56:19, Robert Hyatt wrote:

>On December 13, 2002 at 23:27:02, Robert Hyatt wrote:
>
>>On December 13, 2002 at 23:19:22, Eugene Nalimov wrote:
>>
>>>P4 cache line size is 128 for L2 cache, 64 for L1 cache.
>>
>>Right.  But is not any cache line miss that requires memory going to have
>>to read 128 bytes?  If it isn't in L1, and it isn't in L2, it is going to
>>fill the L2 line, 128 bytes.  If it is not in L1 but is in L2, then the
>>memory read isn't needed and the latency issue for memory (as discussed
>>here) doesn't apply??
>>
>
>
>
>I went back to study the PIV a bit more and it appears my initial thought
>was correct.  Everything that exists in L1 cache is guaranteed to also exist
>in L2.  And L2 sucks in data from memory and it is the only cache that talks
>directly to memory.  The 20+K L1 micro-op cache and the 8K L1 data cache
>talk only to L2. (Intel says 12K micro-ops, which seems to translate into
>roughly 21K bytes of micro-ops, just to make this compare with the older
>PIII with 16K I and D cache (L1).  PIII has more Dcache, but less Icache.
>
>
>The only issue is that a couple of the linux guys once quoted the intel PIV
>specs as saying 64 bytes/line for L2 as well as L1, which contradicts some-
>thing I had read (I think) on the Intel web site.  Perhaps this was tuned to
>RDRAM but not done on DDR RAM.  I will hedge on this until I find a clear
>and precise answer, which I have not been able to do tonight, so far.

"Intel Pentium 4 and Intel Xeon Processor Optimization".

ftp://download.intel.com/design/Pentium4/manuals/24896607.pdf

Table 1-1 on page 1-19.

Thanks,
Eugene

>
>>
>>
>>
>>>
>>>Thanks,
>>>Eugene
>>>
>>>On December 13, 2002 at 22:56:17, Robert Hyatt wrote:
>>>
>>>>On December 13, 2002 at 16:41:07, Vincent Diepeveen wrote:
>>>>
>>>>>On December 13, 2002 at 16:03:47, Robert Hyatt wrote:
>>>>>
>>>>>your math is wrong for many reasons.
>>>>>
>>>>>0) it isn't 32 bytes but 64 bytes that you get at once
>>>>>   garantueed.
>>>>
>>>>Depends on the processor.  For PIII and earlier, it is _32_ bytes.  For
>>>>the PIV it is 128 bytes.  I think AMD is 64...
>>>>
>>>>
>>>>>1) you can garantuee that you need just cacheline
>>>>
>>>>Yes you can, by making sure you probe to a starting address that is
>>>>divisible by the cache line size exactly.  Are you doing that?  Are
>>>>you sure your table is initially aligned on a multiple of cache line
>>>>size?  Didn't think so.  You can't control malloc() that well yet...
>>>>And it isn't smart enough to know it should do that, particularly when
>>>>the alignment is processor dependent.
>>>>
>>>>
>>>>
>>>>
>>>>>2) even if you need 2, then those aren't 400 clocks each
>>>>>   cache line but the first one is 400 and the second
>>>>>   one is only a very small part of that (consider the
>>>>>   bandwidth the memory delivers!)
>>>>
>>>>Try again.  You burst load one cache line and that is all.  The first 8
>>>>bytes comes across after a long latency.  The rest of the line bursts in
>>>>quickly.  For the next cache miss, back you go to the long latency.
>>>>
>>>>
>>>>>3) you get more transpositions from 4 probes. it works better than 2
>>>>>   and a *lot* better.
>>>>
>>>>I don't think a "lot" better.  I ran this test.  The first Crafty versions
>>>>used a N probe approach as that was what I did in Cray Blitz.  Multiple
>>>>probes are somewhat better.  But not "a lot" better.  And the memory bandwidth
>>>>hurts.  Probing consecutive addresses is bad from a theoretical hashing point
>>>>of view as it leads to "chaining".  Probing non-consecutive addresses is bad
>>>>from a cache pre-fetch point of view.
>>>>
>>>>>
>>>>>3) compare with the sureness of 2 very slow cache lines which you
>>>>>   get from seperated parts in memory which is a garantueed 800 clocks
>>>>>   or 50% of the total system time possibly.
>>>>
>>>>
>>>>Ditto for yours.  800 clocks too...
>>>>
>>>>
>>>>>
>>>>>Best regards,
>>>>>Vincent
>>>>>
>>>>>>On December 13, 2002 at 15:43:52, Vincent Diepeveen wrote:
>>>>>>
>>>>>>>On December 13, 2002 at 14:33:47, Robert Hyatt wrote:
>>>>>>>
>>>>>>>>On December 13, 2002 at 14:08:29, Vincent Diepeveen wrote:
>>>>>>>>
>>>>>>>>>hello,
>>>>>>>>>
>>>>>>>>>Here some testresults of DIEP thanks to Chad Cowan at an
>>>>>>>>>asus motherboard with HT turned on (amazingly no longer
>>>>>>>>>SMT called, i forgot which manufacturer calls it HT and
>>>>>>>>>which one SMT. I guess it's Hyperthreading now for intel).
>>>>>>>>
>>>>>>>>It is _both_.  SMT and HT.  You can find either term listed on Intel's
>>>>>>>>web site.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>HT turned on in all cases:
>>>>>>>>>
>>>>>>>>>bus 533Mhz memory 133Mhz (DDR SDRAM cas2)
>>>>>>>>>single cpu P4 3.105Ghz (bus 135 Mhz by default, not 133) : 101394
>>>>>>>>>single cpu P4 3.105Ghz now 2 processes DIEP              : 120095
>>>>>>>>>
>>>>>>>>>So speedup like 18% for HT. Not bad. Not good either, knowing diep
>>>>>>>>>hardly locks.
>>>>>>>>
>>>>>>>>It isn't just a lock issue.  If both threads are banging on memory, it can't
>>>>>>>>run much faster, as it still serializes the memory reads and they are slow.
>>>>>>>>
>>>>>>>>
>>>>>>>>Hmm.. Aren't you the same person that was saying "hyper-threading doesn't
>>>>>>>>work" and "hyper-threading only works on machines that won't be available
>>>>>>>>for 1-2 years in the future"??  And that "Nalimov is running on a machine
>>>>>>>>that nobody can buy"  and "the 2.8ghz xeon doesn't support hyper-threading"?
>>>>>>>>and so forth???
>>>>>>>
>>>>>>>intel marketing department is already 2 years busy with a SMT campaign.
>>>>>>>Only now in the USA there are a few 2.8Ghz Xeons and 3Ghz P4s.
>>>>>>>
>>>>>>>they say only the 3Ghz P4s and xeon mp's have HT/smt support whereas
>>>>>>>the old SMT worked badly for me and diep.
>>>>>>
>>>>>>What "old SMT"??  There isn't any.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>also 18% speedup ain't much knowing their cpu is already quite some slower
>>>>>>>than the k7 is.
>>>>>>>
>>>>>>>Would waitforsingleobject() in windoze speedup more than hammering into
>>>>>>>the same cache line?
>>>>>>
>>>>>>
>>>>>>Read the intel website about spinlocks.  Spinlocks are better for short-duration
>>>>>>waits.
>>>>>>O/S blocking calls are better for long waits.  This has always been the case,
>>>>>>but since I
>>>>>>don't have any "long wait times" I only use spins.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>Means i lose some 2ms each process though to wake it up when it's asleep.
>>>>>>>
>>>>>>>Not a holy grail solution either seemingly.
>>>>>>
>>>>>>You don't have to take that penalty.  Just spin "properly".
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>However there is 1 problem i have with it when i compare that speed
>>>>>>>>>of the same version with 2.4Ghz northwood.
>>>>>>>>>
>>>>>>>>>That 2.4Ghz is exactly the speed of a K7 at 1.6ghz
>>>>>>>>>
>>>>>>>>>Now the same K7 same version logs:
>>>>>>>>>    single cpu : 82499
>>>>>>>>>    dual       : 154293
>>>>>>>>>
>>>>>>>>>Note that the k7 has way way slower RAM and chipset. 133Mhz registered cas 2.5
>>>>>>>>>i guess versus fast cas 2 (like 2T less for latency, so 10 versus 12T or
>>>>>>>>>something) for the P4. The P4 was a single cpu.
>>>>>>>>>
>>>>>>>>>but here the math for those who still read here that's interesting to
>>>>>>>>>hear.
>>>>>>>>>
>>>>>>>>>Single cpu speed difference is:
>>>>>>>>>  P4 3.06Ghz is faster : 22.9%
>>>>>>>>>
>>>>>>>>>Based upon the speed where it is clocked at (3105Mhz)
>>>>>>>>>we would expect a speedup of 3.105 / 2.4 = 29.4%
>>>>>>>>>
>>>>>>>>>So somehow we lose around 7% in the process.
>>>>>>>>
>>>>>>>>Memory is no faster.  So there is going to be a loss every time the cpu clock is
>>>>>>>>ramped up a notch.  Always has been.  Always will be until DRAM disappears.
>>>>>>>
>>>>>>>Yes i/o bottleneck will be more and more getting a problem.
>>>>>>>
>>>>>>>>>
>>>>>>>>>Now it wins another 18% or so when it gets run with 2 processes.
>>>>>>>>>If i compare that with a single cpu K7 to get the relative
>>>>>>>>>speed of a P4 Ghz versus a K7 Ghz then we get next compare:
>>>>>>>>>
>>>>>>>>>1.6Ghz * (120k / 82k) = 2.33Ghz
>>>>>>>>>
>>>>>>>>>so a 2.33Ghz K7 should be equally fast to a P4 at such a speed.
>>>>>>>>>Of course assuming linearly scaling.
>>>>>>>>>
>>>>>>>>>Now we calculate what 1Ghz K7 compares to in speed with P4: 1.33
>>>>>>>>>
>>>>>>>>>So DDR ram proves to be the big winner for the P4. SMT in itself
>>>>>>>>>is just a trick that works for me because my parallellism is
>>>>>>>>>pretty ok and most likely not for everyone.
>>>>>>>>
>>>>>>>>Works just fine for me too, as I have already reported and as has Eugene
>>>>>>>>and others...
>>>>>>>
>>>>>>>initially i thought P4 would suck forever, but i didn't realize the
>>>>>>>major negative impact RDRAM had at the time.
>>>>>>>
>>>>>>>>>
>>>>>>>>>Now of course it's questionable whether that 18% speedup in nodes
>>>>>>>>>a second also results in actual positive speedup in plydepth.
>>>>>>>>>
>>>>>>>>>For DIEP it is, but it's not so impressive at all.
>>>>>>>>
>>>>>>>>Nope.  but that is not a processor issue, that is a search issue.  The _cpu_ is
>>>>>>>>faster with SMT on.  Just because a chess engine can't use that very well
>>>>>>>>doesn't mean
>>>>>>>>that other applications without the search overhead issue won't benefit, and in
>>>>>>>>fact
>>>>>>>>they do benefit pretty well...
>>>>>>>
>>>>>>>diep's making perfect use of it. of course it's evaluating most of the time
>>>>>>>and if that's not in the trace cache that code, then the processor has
>>>>>>>the problem that it can only serve 1 process at a time.
>>>>>>
>>>>>>It only serves one at a time most of the time.  It is interlacing micro-ops, but
>>>>>>most of
>>>>>>the time one thread is blocked waiting on a memory read, which is the slowest
>>>>>>thing
>>>>>>on the processor.
>>>>>>
>>>>>>>
>>>>>>>>The interesting thing I have noted is that the SMT benefit just about offsets my
>>>>>>>>parallel
>>>>>>>>search overhead for the typical case.   If I run a single thread on my 2.8 xeon,
>>>>>>>>I get a search
>>>>>>>>time of X.  If I run four threads, to use both cpus with HT enabled, I get a
>>>>>>>>search time of
>>>>>>>>very close to X/2.  The 20-30% speedup by HT is just about what it takes to
>>>>>>>>offset the
>>>>>>>
>>>>>>>20-30% is quite overestimation of the speedup by HT. You gotta improve
>>>>>>>hashtable management a bit then. like 4 probes within the same cache line
>>>>>>>instead of a 2 table approach.
>>>>>>
>>>>>>Your 3-4 probe idea is not so good for reasons I mentioned already.  You are
>>>>>>going to
>>>>>>_average_ two cache line reads anyway, because the first table entry is not
>>>>>>guaranteed to
>>>>>>be on the leading edge of the cache line.  On average, you are going to load
>>>>>>stuff that is
>>>>>>_before_ the entry you want as well as after.  And on average, with 16 byte
>>>>>>entries, I
>>>>>>would expect to see (with 32 byte cache lines) that one probe takes one cache
>>>>>>miss,
>>>>>>the second will take another cache miss 1/2 of the time.  Four probes guarantees
>>>>>>that you
>>>>>>are going to get at least two cache misses most of the time, which is _exactly_
>>>>>>what I get
>>>>>>with two separate tables.  No loss or gain...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>crafty needs 1600 clocks a node on average (k7 timing, bit more for p4
>>>>>>>but same idea).
>>>>>>>
>>>>>>>each ram reference (ddr ram) you lose like 400 clocks.
>>>>>>>
>>>>>>>if you use 1 table you lose 400 clocks and can do 4 probes in it.
>>>>>>
>>>>>>No you can't for the reason I gave above.  Unless you force your hash probe
>>>>>>address to
>>>>>>be a multiple of the cache line size, so that you always get the first entry you
>>>>>>want in
>>>>>>the beginning of the cache line.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>if you use 2 tables as you do now you use 800 clocks.
>>>>>>>
>>>>>>>it is trivial that if the 2 extra threads in crafty can save out a few
>>>>>>>references to the DDR ram and get it from the L2 cache, that that's
>>>>>>>a pretty important basis for nodes a second win in crafty when using
>>>>>>>mt=4 at a dual xeon.
>>>>>>
>>>>>>
>>>>>>I can turn off the second probe and my nps doesn't increase much.  I'll post the
>>>>>>data
>>>>>>later tonight.  That simply means that the second probe is not a problem.  And
>>>>>>the way
>>>>>>the pipeline works, both loads are sent off back-to-back so that there is not a
>>>>>>clean
>>>>>>400 clock wait for the first and another 400 clock wait for the second.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>extra search overhead caused by the extra processor.  Which means that for the
>>>>>>>>time being,
>>>>>>>>it is possible to search almost exactly twice as fast using two cpus, although
>>>>>>>>this comparison
>>>>>>>>is not exactly a correct way to compare things.
>>>>>>>
>>>>>>>no it's not possible to search 2 times faster. the problem is that the
>>>>>>>L1 cache and the trace cache are too small and it can't even feed 1
>>>>>>>processor when decoding instructions, not to mention 2.
>>>>>>
>>>>>>It is _definitely_ possible.  I'll post the data as I have already done this
>>>>>>once...
>>>>>>The drawback is that while it is searching 2x faster, it is using 4 threads, so
>>>>>>that 2x is
>>>>>>not exactly a fair comparison.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>apart from that there is other technical problems when both processors
>>>>>>>want something.
>>>>>>
>>>>>>So?  this is common in operating system process scheduling also..
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>but 18% speedup from it is better than nothing.
>>>>>>>
>>>>>>>too bad that because of that 18% speedup the processor is 2 times more
>>>>>>>expensive.
>>>>>>
>>>>>>
>>>>>>The 2.8 xeons are going for around $450.00...
>>>>>>
>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>>Because a dual Xeon 2.8Ghz which i will assume also having a compare
>>>>>>>>>of 1.4 then (assuming not cas2 ddr ram but of course ecc registered
>>>>>>>>>which eats extra time)
>>>>>>>>
>>>>>>>>However the xeon has 2-way memory interleaving which runs the bandwidth way up
>>>>>>>>compared to the desktop PIV system.
>>>>>>>
>>>>>>>2 way memory interleaving for 4 processes doesn't kick butt.
>>>>>>
>>>>>>It does better than no interleaving, by a factor of 2.0...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>the problem is you lose time to the ECC and registered features of the
>>>>>>>memory you need for the dual. of course that's the case for all duals.
>>>>>>>both K7 MP and Xeon suffer from that regrettably.
>>>>>>
>>>>>>That is not true.  The duals do _not_ have to have ECC ram.  And it doesn't
>>>>>>appear to be
>>>>>>any slower than non-ECC ram although I will be able to test that before long as
>>>>>>we have
>>>>>>some non-ECC machines coming in.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>A result is that single cpu tests can be carried out much faster in general.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>That means that the equivalent K7 will be a dual K7 2.0Ghz, thereby
>>>>>>>>>still not taking into account 3 things
>>>>>>>>>
>>>>>>>>>  a) my diep version was msvc compiled with processorpack (sp4)
>>>>>>>>>     so it was simply not optimized for K7 at all, but more for p4
>>>>>>>>>     than it was optimized for K7. Not using MMX of course (would
>>>>>>>>>     slow down on P4 and let the K7 look relatively better).
>>>>>>>>>  b) speedup at 4 processors is a lot worse than at 2 processors
>>>>>>>>>     so when i run diep with 4 processes at the dual Xeon 2.8
>>>>>>>>>     the expectation is that the K7 dual 2.0 Ghz will outgun it
>>>>>>>>>     by quite some margin.
>>>>>>>>>  c) that dual k7 2.0Ghz is less than half the price of a dual P4 2.8Ghz
>>>>>>>>>
>>>>>>>>
>>>>>>>>There are no dual PIV's at the moment.  Only dual xeons.  Xeons are _not_
>>>>>>>>PIV's....  For several reasons that can be found on the Intel web site.  That's
>>>>>>>>why
>>>>>>>
>>>>>>>they also fit in different slots for example. like 603 for xeon and 478 or
>>>>>>>something for the P4.
>>>>>>
>>>>>>xeon has 603 and 604 pin sockets to separate them.
>>>>>>
>>>>>>>
>>>>>>>>xeons are considered to be their "server class chips" while the PIV is their
>>>>>>>>"desktop
>>>>>>>>class chip".
>>>>>>>
>>>>>>>the core is the same however. So a 3.06Ghz Xeon when it gets released won't
>>>>>>>be faster single cpu when put single cpu in a mainboard than a P4 3.06Ghz.
>>>>>>
>>>>>>Probably will as the xeons use a different chipset which can support
>>>>>>interleaving where
>>>>>>the desktop chipsets don't.
>>>>>>
>>>>>>>
>>>>>>>With some luck by the time they release a 3.06Ghz Xeon they have improved
>>>>>>>the SMT another bit.
>>>>>>>
>>>>>>>Seems to me they working for years to get that SMT/HT slowly better working.
>>>>>>
>>>>>>Not "for years".  It was announced as a coming thing a couple of years ago and
>>>>>>several
>>>>>>vendors have been discussing the idea.  And they are going to increase the ratio
>>>>>>of physical
>>>>>>to logical cpus before long also...
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>Best regards,
>>>>>>>>>Vincent
Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading Robert Hyatt 10:49:18 12/14/02
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.