Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading

Author: Robert Hyatt
Date: 10:49:18 12/14/02
On December 14, 2002 at 01:31:13, Eugene Nalimov wrote:

>On December 14, 2002 at 00:56:19, Robert Hyatt wrote:
>
>>On December 13, 2002 at 23:27:02, Robert Hyatt wrote:
>>
>>>On December 13, 2002 at 23:19:22, Eugene Nalimov wrote:
>>>
>>>>P4 cache line size is 128 for L2 cache, 64 for L1 cache.
>>>
>>>Right.  But is not any cache line miss that requires memory going to have
>>>to read 128 bytes?  If it isn't in L1, and it isn't in L2, it is going to
>>>fill the L2 line, 128 bytes.  If it is not in L1 but is in L2, then the
>>>memory read isn't needed and the latency issue for memory (as discussed
>>>here) doesn't apply??
>>>
>>
>>
>>
>>I went back to study the PIV a bit more and it appears my initial thought
>>was correct.  Everything that exists in L1 cache is guaranteed to also exist
>>in L2.  And L2 sucks in data from memory and it is the only cache that talks
>>directly to memory.  The 20+K L1 micro-op cache and the 8K L1 data cache
>>talk only to L2. (Intel says 12K micro-ops, which seems to translate into
>>roughly 21K bytes of micro-ops, just to make this compare with the older
>>PIII with 16K I and D cache (L1).  PIII has more Dcache, but less Icache.
>>
>>
>>The only issue is that a couple of the linux guys once quoted the intel PIV
>>specs as saying 64 bytes/line for L2 as well as L1, which contradicts some-
>>thing I had read (I think) on the Intel web site.  Perhaps this was tuned to
>>RDRAM but not done on DDR RAM.  I will hedge on this until I find a clear
>>and precise answer, which I have not been able to do tonight, so far.
>
>"Intel Pentium 4 and Intel Xeon Processor Optimization".
>
>ftp://download.intel.com/design/Pentium4/manuals/24896607.pdf
>
>Table 1-1 on page 1-19.
>
>Thanks,
>Eugene

Thanks.

That comfirms that 128 bytes get sucked out of memory on a cache line fill.

Not sure why the linux guys were confused unless that "sector" stuff confused
them (apparently it can write either half of the line back to memory if the data
is "dirty" but it doesn't have to write both halves back unless both are dirty,
which
is cute.


>
>>
>>>
>>>
>>>
>>>>
>>>>Thanks,
>>>>Eugene
>>>>
>>>>On December 13, 2002 at 22:56:17, Robert Hyatt wrote:
>>>>
>>>>>On December 13, 2002 at 16:41:07, Vincent Diepeveen wrote:
>>>>>
>>>>>>On December 13, 2002 at 16:03:47, Robert Hyatt wrote:
>>>>>>
>>>>>>your math is wrong for many reasons.
>>>>>>
>>>>>>0) it isn't 32 bytes but 64 bytes that you get at once
>>>>>>   garantueed.
>>>>>
>>>>>Depends on the processor.  For PIII and earlier, it is _32_ bytes.  For
>>>>>the PIV it is 128 bytes.  I think AMD is 64...
>>>>>
>>>>>
>>>>>>1) you can garantuee that you need just cacheline
>>>>>
>>>>>Yes you can, by making sure you probe to a starting address that is
>>>>>divisible by the cache line size exactly.  Are you doing that?  Are
>>>>>you sure your table is initially aligned on a multiple of cache line
>>>>>size?  Didn't think so.  You can't control malloc() that well yet...
>>>>>And it isn't smart enough to know it should do that, particularly when
>>>>>the alignment is processor dependent.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>2) even if you need 2, then those aren't 400 clocks each
>>>>>>   cache line but the first one is 400 and the second
>>>>>>   one is only a very small part of that (consider the
>>>>>>   bandwidth the memory delivers!)
>>>>>
>>>>>Try again.  You burst load one cache line and that is all.  The first 8
>>>>>bytes comes across after a long latency.  The rest of the line bursts in
>>>>>quickly.  For the next cache miss, back you go to the long latency.
>>>>>
>>>>>
>>>>>>3) you get more transpositions from 4 probes. it works better than 2
>>>>>>   and a *lot* better.
>>>>>
>>>>>I don't think a "lot" better.  I ran this test.  The first Crafty versions
>>>>>used a N probe approach as that was what I did in Cray Blitz.  Multiple
>>>>>probes are somewhat better.  But not "a lot" better.  And the memory bandwidth
>>>>>hurts.  Probing consecutive addresses is bad from a theoretical hashing point
>>>>>of view as it leads to "chaining".  Probing non-consecutive addresses is bad
>>>>>from a cache pre-fetch point of view.
>>>>>
>>>>>>
>>>>>>3) compare with the sureness of 2 very slow cache lines which you
>>>>>>   get from seperated parts in memory which is a garantueed 800 clocks
>>>>>>   or 50% of the total system time possibly.
>>>>>
>>>>>
>>>>>Ditto for yours.  800 clocks too...
>>>>>
>>>>>
>>>>>>
>>>>>>Best regards,
>>>>>>Vincent
>>>>>>
>>>>>>>On December 13, 2002 at 15:43:52, Vincent Diepeveen wrote:
>>>>>>>
>>>>>>>>On December 13, 2002 at 14:33:47, Robert Hyatt wrote:
>>>>>>>>
>>>>>>>>>On December 13, 2002 at 14:08:29, Vincent Diepeveen wrote:
>>>>>>>>>
>>>>>>>>>>hello,
>>>>>>>>>>
>>>>>>>>>>Here some testresults of DIEP thanks to Chad Cowan at an
>>>>>>>>>>asus motherboard with HT turned on (amazingly no longer
>>>>>>>>>>SMT called, i forgot which manufacturer calls it HT and
>>>>>>>>>>which one SMT. I guess it's Hyperthreading now for intel).
>>>>>>>>>
>>>>>>>>>It is _both_.  SMT and HT.  You can find either term listed on Intel's
>>>>>>>>>web site.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>HT turned on in all cases:
>>>>>>>>>>
>>>>>>>>>>bus 533Mhz memory 133Mhz (DDR SDRAM cas2)
>>>>>>>>>>single cpu P4 3.105Ghz (bus 135 Mhz by default, not 133) : 101394
>>>>>>>>>>single cpu P4 3.105Ghz now 2 processes DIEP              : 120095
>>>>>>>>>>
>>>>>>>>>>So speedup like 18% for HT. Not bad. Not good either, knowing diep
>>>>>>>>>>hardly locks.
>>>>>>>>>
>>>>>>>>>It isn't just a lock issue.  If both threads are banging on memory, it can't
>>>>>>>>>run much faster, as it still serializes the memory reads and they are slow.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>Hmm.. Aren't you the same person that was saying "hyper-threading doesn't
>>>>>>>>>work" and "hyper-threading only works on machines that won't be available
>>>>>>>>>for 1-2 years in the future"??  And that "Nalimov is running on a machine
>>>>>>>>>that nobody can buy"  and "the 2.8ghz xeon doesn't support hyper-threading"?
>>>>>>>>>and so forth???
>>>>>>>>
>>>>>>>>intel marketing department is already 2 years busy with a SMT campaign.
>>>>>>>>Only now in the USA there are a few 2.8Ghz Xeons and 3Ghz P4s.
>>>>>>>>
>>>>>>>>they say only the 3Ghz P4s and xeon mp's have HT/smt support whereas
>>>>>>>>the old SMT worked badly for me and diep.
>>>>>>>
>>>>>>>What "old SMT"??  There isn't any.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>also 18% speedup ain't much knowing their cpu is already quite some slower
>>>>>>>>than the k7 is.
>>>>>>>>
>>>>>>>>Would waitforsingleobject() in windoze speedup more than hammering into
>>>>>>>>the same cache line?
>>>>>>>
>>>>>>>
>>>>>>>Read the intel website about spinlocks.  Spinlocks are better for short-duration
>>>>>>>waits.
>>>>>>>O/S blocking calls are better for long waits.  This has always been the case,
>>>>>>>but since I
>>>>>>>don't have any "long wait times" I only use spins.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>Means i lose some 2ms each process though to wake it up when it's asleep.
>>>>>>>>
>>>>>>>>Not a holy grail solution either seemingly.
>>>>>>>
>>>>>>>You don't have to take that penalty.  Just spin "properly".
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>However there is 1 problem i have with it when i compare that speed
>>>>>>>>>>of the same version with 2.4Ghz northwood.
>>>>>>>>>>
>>>>>>>>>>That 2.4Ghz is exactly the speed of a K7 at 1.6ghz
>>>>>>>>>>
>>>>>>>>>>Now the same K7 same version logs:
>>>>>>>>>>    single cpu : 82499
>>>>>>>>>>    dual       : 154293
>>>>>>>>>>
>>>>>>>>>>Note that the k7 has way way slower RAM and chipset. 133Mhz registered cas 2.5
>>>>>>>>>>i guess versus fast cas 2 (like 2T less for latency, so 10 versus 12T or
>>>>>>>>>>something) for the P4. The P4 was a single cpu.
>>>>>>>>>>
>>>>>>>>>>but here the math for those who still read here that's interesting to
>>>>>>>>>>hear.
>>>>>>>>>>
>>>>>>>>>>Single cpu speed difference is:
>>>>>>>>>>  P4 3.06Ghz is faster : 22.9%
>>>>>>>>>>
>>>>>>>>>>Based upon the speed where it is clocked at (3105Mhz)
>>>>>>>>>>we would expect a speedup of 3.105 / 2.4 = 29.4%
>>>>>>>>>>
>>>>>>>>>>So somehow we lose around 7% in the process.
>>>>>>>>>
>>>>>>>>>Memory is no faster.  So there is going to be a loss every time the cpu clock is
>>>>>>>>>ramped up a notch.  Always has been.  Always will be until DRAM disappears.
>>>>>>>>
>>>>>>>>Yes i/o bottleneck will be more and more getting a problem.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>Now it wins another 18% or so when it gets run with 2 processes.
>>>>>>>>>>If i compare that with a single cpu K7 to get the relative
>>>>>>>>>>speed of a P4 Ghz versus a K7 Ghz then we get next compare:
>>>>>>>>>>
>>>>>>>>>>1.6Ghz * (120k / 82k) = 2.33Ghz
>>>>>>>>>>
>>>>>>>>>>so a 2.33Ghz K7 should be equally fast to a P4 at such a speed.
>>>>>>>>>>Of course assuming linearly scaling.
>>>>>>>>>>
>>>>>>>>>>Now we calculate what 1Ghz K7 compares to in speed with P4: 1.33
>>>>>>>>>>
>>>>>>>>>>So DDR ram proves to be the big winner for the P4. SMT in itself
>>>>>>>>>>is just a trick that works for me because my parallellism is
>>>>>>>>>>pretty ok and most likely not for everyone.
>>>>>>>>>
>>>>>>>>>Works just fine for me too, as I have already reported and as has Eugene
>>>>>>>>>and others...
>>>>>>>>
>>>>>>>>initially i thought P4 would suck forever, but i didn't realize the
>>>>>>>>major negative impact RDRAM had at the time.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>Now of course it's questionable whether that 18% speedup in nodes
>>>>>>>>>>a second also results in actual positive speedup in plydepth.
>>>>>>>>>>
>>>>>>>>>>For DIEP it is, but it's not so impressive at all.
>>>>>>>>>
>>>>>>>>>Nope.  but that is not a processor issue, that is a search issue.  The _cpu_ is
>>>>>>>>>faster with SMT on.  Just because a chess engine can't use that very well
>>>>>>>>>doesn't mean
>>>>>>>>>that other applications without the search overhead issue won't benefit, and in
>>>>>>>>>fact
>>>>>>>>>they do benefit pretty well...
>>>>>>>>
>>>>>>>>diep's making perfect use of it. of course it's evaluating most of the time
>>>>>>>>and if that's not in the trace cache that code, then the processor has
>>>>>>>>the problem that it can only serve 1 process at a time.
>>>>>>>
>>>>>>>It only serves one at a time most of the time.  It is interlacing micro-ops, but
>>>>>>>most of
>>>>>>>the time one thread is blocked waiting on a memory read, which is the slowest
>>>>>>>thing
>>>>>>>on the processor.
>>>>>>>
>>>>>>>>
>>>>>>>>>The interesting thing I have noted is that the SMT benefit just about offsets my
>>>>>>>>>parallel
>>>>>>>>>search overhead for the typical case.   If I run a single thread on my 2.8 xeon,
>>>>>>>>>I get a search
>>>>>>>>>time of X.  If I run four threads, to use both cpus with HT enabled, I get a
>>>>>>>>>search time of
>>>>>>>>>very close to X/2.  The 20-30% speedup by HT is just about what it takes to
>>>>>>>>>offset the
>>>>>>>>
>>>>>>>>20-30% is quite overestimation of the speedup by HT. You gotta improve
>>>>>>>>hashtable management a bit then. like 4 probes within the same cache line
>>>>>>>>instead of a 2 table approach.
>>>>>>>
>>>>>>>Your 3-4 probe idea is not so good for reasons I mentioned already.  You are
>>>>>>>going to
>>>>>>>_average_ two cache line reads anyway, because the first table entry is not
>>>>>>>guaranteed to
>>>>>>>be on the leading edge of the cache line.  On average, you are going to load
>>>>>>>stuff that is
>>>>>>>_before_ the entry you want as well as after.  And on average, with 16 byte
>>>>>>>entries, I
>>>>>>>would expect to see (with 32 byte cache lines) that one probe takes one cache
>>>>>>>miss,
>>>>>>>the second will take another cache miss 1/2 of the time.  Four probes guarantees
>>>>>>>that you
>>>>>>>are going to get at least two cache misses most of the time, which is _exactly_
>>>>>>>what I get
>>>>>>>with two separate tables.  No loss or gain...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>crafty needs 1600 clocks a node on average (k7 timing, bit more for p4
>>>>>>>>but same idea).
>>>>>>>>
>>>>>>>>each ram reference (ddr ram) you lose like 400 clocks.
>>>>>>>>
>>>>>>>>if you use 1 table you lose 400 clocks and can do 4 probes in it.
>>>>>>>
>>>>>>>No you can't for the reason I gave above.  Unless you force your hash probe
>>>>>>>address to
>>>>>>>be a multiple of the cache line size, so that you always get the first entry you
>>>>>>>want in
>>>>>>>the beginning of the cache line.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>if you use 2 tables as you do now you use 800 clocks.
>>>>>>>>
>>>>>>>>it is trivial that if the 2 extra threads in crafty can save out a few
>>>>>>>>references to the DDR ram and get it from the L2 cache, that that's
>>>>>>>>a pretty important basis for nodes a second win in crafty when using
>>>>>>>>mt=4 at a dual xeon.
>>>>>>>
>>>>>>>
>>>>>>>I can turn off the second probe and my nps doesn't increase much.  I'll post the
>>>>>>>data
>>>>>>>later tonight.  That simply means that the second probe is not a problem.  And
>>>>>>>the way
>>>>>>>the pipeline works, both loads are sent off back-to-back so that there is not a
>>>>>>>clean
>>>>>>>400 clock wait for the first and another 400 clock wait for the second.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>extra search overhead caused by the extra processor.  Which means that for the
>>>>>>>>>time being,
>>>>>>>>>it is possible to search almost exactly twice as fast using two cpus, although
>>>>>>>>>this comparison
>>>>>>>>>is not exactly a correct way to compare things.
>>>>>>>>
>>>>>>>>no it's not possible to search 2 times faster. the problem is that the
>>>>>>>>L1 cache and the trace cache are too small and it can't even feed 1
>>>>>>>>processor when decoding instructions, not to mention 2.
>>>>>>>
>>>>>>>It is _definitely_ possible.  I'll post the data as I have already done this
>>>>>>>once...
>>>>>>>The drawback is that while it is searching 2x faster, it is using 4 threads, so
>>>>>>>that 2x is
>>>>>>>not exactly a fair comparison.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>apart from that there is other technical problems when both processors
>>>>>>>>want something.
>>>>>>>
>>>>>>>So?  this is common in operating system process scheduling also..
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>but 18% speedup from it is better than nothing.
>>>>>>>>
>>>>>>>>too bad that because of that 18% speedup the processor is 2 times more
>>>>>>>>expensive.
>>>>>>>
>>>>>>>
>>>>>>>The 2.8 xeons are going for around $450.00...
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>>
>>>>>>>>>>Because a dual Xeon 2.8Ghz which i will assume also having a compare
>>>>>>>>>>of 1.4 then (assuming not cas2 ddr ram but of course ecc registered
>>>>>>>>>>which eats extra time)
>>>>>>>>>
>>>>>>>>>However the xeon has 2-way memory interleaving which runs the bandwidth way up
>>>>>>>>>compared to the desktop PIV system.
>>>>>>>>
>>>>>>>>2 way memory interleaving for 4 processes doesn't kick butt.
>>>>>>>
>>>>>>>It does better than no interleaving, by a factor of 2.0...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>the problem is you lose time to the ECC and registered features of the
>>>>>>>>memory you need for the dual. of course that's the case for all duals.
>>>>>>>>both K7 MP and Xeon suffer from that regrettably.
>>>>>>>
>>>>>>>That is not true.  The duals do _not_ have to have ECC ram.  And it doesn't
>>>>>>>appear to be
>>>>>>>any slower than non-ECC ram although I will be able to test that before long as
>>>>>>>we have
>>>>>>>some non-ECC machines coming in.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>A result is that single cpu tests can be carried out much faster in general.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>That means that the equivalent K7 will be a dual K7 2.0Ghz, thereby
>>>>>>>>>>still not taking into account 3 things
>>>>>>>>>>
>>>>>>>>>>  a) my diep version was msvc compiled with processorpack (sp4)
>>>>>>>>>>     so it was simply not optimized for K7 at all, but more for p4
>>>>>>>>>>     than it was optimized for K7. Not using MMX of course (would
>>>>>>>>>>     slow down on P4 and let the K7 look relatively better).
>>>>>>>>>>  b) speedup at 4 processors is a lot worse than at 2 processors
>>>>>>>>>>     so when i run diep with 4 processes at the dual Xeon 2.8
>>>>>>>>>>     the expectation is that the K7 dual 2.0 Ghz will outgun it
>>>>>>>>>>     by quite some margin.
>>>>>>>>>>  c) that dual k7 2.0Ghz is less than half the price of a dual P4 2.8Ghz
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>There are no dual PIV's at the moment.  Only dual xeons.  Xeons are _not_
>>>>>>>>>PIV's....  For several reasons that can be found on the Intel web site.  That's
>>>>>>>>>why
>>>>>>>>
>>>>>>>>they also fit in different slots for example. like 603 for xeon and 478 or
>>>>>>>>something for the P4.
>>>>>>>
>>>>>>>xeon has 603 and 604 pin sockets to separate them.
>>>>>>>
>>>>>>>>
>>>>>>>>>xeons are considered to be their "server class chips" while the PIV is their
>>>>>>>>>"desktop
>>>>>>>>>class chip".
>>>>>>>>
>>>>>>>>the core is the same however. So a 3.06Ghz Xeon when it gets released won't
>>>>>>>>be faster single cpu when put single cpu in a mainboard than a P4 3.06Ghz.
>>>>>>>
>>>>>>>Probably will as the xeons use a different chipset which can support
>>>>>>>interleaving where
>>>>>>>the desktop chipsets don't.
>>>>>>>
>>>>>>>>
>>>>>>>>With some luck by the time they release a 3.06Ghz Xeon they have improved
>>>>>>>>the SMT another bit.
>>>>>>>>
>>>>>>>>Seems to me they working for years to get that SMT/HT slowly better working.
>>>>>>>
>>>>>>>Not "for years".  It was announced as a coming thing a couple of years ago and
>>>>>>>several
>>>>>>>vendors have been discussing the idea.  And they are going to increase the ratio
>>>>>>>of physical
>>>>>>>to logical cpus before long also...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>Best regards,
>>>>>>>>>>Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.