Author: Eugene Nalimov
Date: 20:19:22 12/13/02
Go up one level in this thread
P4 cache line size is 128 for L2 cache, 64 for L1 cache. Thanks, Eugene On December 13, 2002 at 22:56:17, Robert Hyatt wrote: >On December 13, 2002 at 16:41:07, Vincent Diepeveen wrote: > >>On December 13, 2002 at 16:03:47, Robert Hyatt wrote: >> >>your math is wrong for many reasons. >> >>0) it isn't 32 bytes but 64 bytes that you get at once >> garantueed. > >Depends on the processor. For PIII and earlier, it is _32_ bytes. For >the PIV it is 128 bytes. I think AMD is 64... > > >>1) you can garantuee that you need just cacheline > >Yes you can, by making sure you probe to a starting address that is >divisible by the cache line size exactly. Are you doing that? Are >you sure your table is initially aligned on a multiple of cache line >size? Didn't think so. You can't control malloc() that well yet... >And it isn't smart enough to know it should do that, particularly when >the alignment is processor dependent. > > > > >>2) even if you need 2, then those aren't 400 clocks each >> cache line but the first one is 400 and the second >> one is only a very small part of that (consider the >> bandwidth the memory delivers!) > >Try again. You burst load one cache line and that is all. The first 8 >bytes comes across after a long latency. The rest of the line bursts in >quickly. For the next cache miss, back you go to the long latency. > > >>3) you get more transpositions from 4 probes. it works better than 2 >> and a *lot* better. > >I don't think a "lot" better. I ran this test. The first Crafty versions >used a N probe approach as that was what I did in Cray Blitz. Multiple >probes are somewhat better. But not "a lot" better. And the memory bandwidth >hurts. Probing consecutive addresses is bad from a theoretical hashing point >of view as it leads to "chaining". Probing non-consecutive addresses is bad >from a cache pre-fetch point of view. > >> >>3) compare with the sureness of 2 very slow cache lines which you >> get from seperated parts in memory which is a garantueed 800 clocks >> or 50% of the total system time possibly. > > >Ditto for yours. 800 clocks too... > > >> >>Best regards, >>Vincent >> >>>On December 13, 2002 at 15:43:52, Vincent Diepeveen wrote: >>> >>>>On December 13, 2002 at 14:33:47, Robert Hyatt wrote: >>>> >>>>>On December 13, 2002 at 14:08:29, Vincent Diepeveen wrote: >>>>> >>>>>>hello, >>>>>> >>>>>>Here some testresults of DIEP thanks to Chad Cowan at an >>>>>>asus motherboard with HT turned on (amazingly no longer >>>>>>SMT called, i forgot which manufacturer calls it HT and >>>>>>which one SMT. I guess it's Hyperthreading now for intel). >>>>> >>>>>It is _both_. SMT and HT. You can find either term listed on Intel's >>>>>web site. >>>>> >>>>> >>>>>> >>>>>>HT turned on in all cases: >>>>>> >>>>>>bus 533Mhz memory 133Mhz (DDR SDRAM cas2) >>>>>>single cpu P4 3.105Ghz (bus 135 Mhz by default, not 133) : 101394 >>>>>>single cpu P4 3.105Ghz now 2 processes DIEP : 120095 >>>>>> >>>>>>So speedup like 18% for HT. Not bad. Not good either, knowing diep >>>>>>hardly locks. >>>>> >>>>>It isn't just a lock issue. If both threads are banging on memory, it can't >>>>>run much faster, as it still serializes the memory reads and they are slow. >>>>> >>>>> >>>>>Hmm.. Aren't you the same person that was saying "hyper-threading doesn't >>>>>work" and "hyper-threading only works on machines that won't be available >>>>>for 1-2 years in the future"?? And that "Nalimov is running on a machine >>>>>that nobody can buy" and "the 2.8ghz xeon doesn't support hyper-threading"? >>>>>and so forth??? >>>> >>>>intel marketing department is already 2 years busy with a SMT campaign. >>>>Only now in the USA there are a few 2.8Ghz Xeons and 3Ghz P4s. >>>> >>>>they say only the 3Ghz P4s and xeon mp's have HT/smt support whereas >>>>the old SMT worked badly for me and diep. >>> >>>What "old SMT"?? There isn't any. >>> >>> >>>> >>>>also 18% speedup ain't much knowing their cpu is already quite some slower >>>>than the k7 is. >>>> >>>>Would waitforsingleobject() in windoze speedup more than hammering into >>>>the same cache line? >>> >>> >>>Read the intel website about spinlocks. Spinlocks are better for short-duration >>>waits. >>>O/S blocking calls are better for long waits. This has always been the case, >>>but since I >>>don't have any "long wait times" I only use spins. >>> >>> >>> >>> >>>> >>>>Means i lose some 2ms each process though to wake it up when it's asleep. >>>> >>>>Not a holy grail solution either seemingly. >>> >>>You don't have to take that penalty. Just spin "properly". >>> >>> >>>> >>>>> >>>>> >>>>>> >>>>>>However there is 1 problem i have with it when i compare that speed >>>>>>of the same version with 2.4Ghz northwood. >>>>>> >>>>>>That 2.4Ghz is exactly the speed of a K7 at 1.6ghz >>>>>> >>>>>>Now the same K7 same version logs: >>>>>> single cpu : 82499 >>>>>> dual : 154293 >>>>>> >>>>>>Note that the k7 has way way slower RAM and chipset. 133Mhz registered cas 2.5 >>>>>>i guess versus fast cas 2 (like 2T less for latency, so 10 versus 12T or >>>>>>something) for the P4. The P4 was a single cpu. >>>>>> >>>>>>but here the math for those who still read here that's interesting to >>>>>>hear. >>>>>> >>>>>>Single cpu speed difference is: >>>>>> P4 3.06Ghz is faster : 22.9% >>>>>> >>>>>>Based upon the speed where it is clocked at (3105Mhz) >>>>>>we would expect a speedup of 3.105 / 2.4 = 29.4% >>>>>> >>>>>>So somehow we lose around 7% in the process. >>>>> >>>>>Memory is no faster. So there is going to be a loss every time the cpu clock is >>>>>ramped up a notch. Always has been. Always will be until DRAM disappears. >>>> >>>>Yes i/o bottleneck will be more and more getting a problem. >>>> >>>>>> >>>>>>Now it wins another 18% or so when it gets run with 2 processes. >>>>>>If i compare that with a single cpu K7 to get the relative >>>>>>speed of a P4 Ghz versus a K7 Ghz then we get next compare: >>>>>> >>>>>>1.6Ghz * (120k / 82k) = 2.33Ghz >>>>>> >>>>>>so a 2.33Ghz K7 should be equally fast to a P4 at such a speed. >>>>>>Of course assuming linearly scaling. >>>>>> >>>>>>Now we calculate what 1Ghz K7 compares to in speed with P4: 1.33 >>>>>> >>>>>>So DDR ram proves to be the big winner for the P4. SMT in itself >>>>>>is just a trick that works for me because my parallellism is >>>>>>pretty ok and most likely not for everyone. >>>>> >>>>>Works just fine for me too, as I have already reported and as has Eugene >>>>>and others... >>>> >>>>initially i thought P4 would suck forever, but i didn't realize the >>>>major negative impact RDRAM had at the time. >>>> >>>>>> >>>>>>Now of course it's questionable whether that 18% speedup in nodes >>>>>>a second also results in actual positive speedup in plydepth. >>>>>> >>>>>>For DIEP it is, but it's not so impressive at all. >>>>> >>>>>Nope. but that is not a processor issue, that is a search issue. The _cpu_ is >>>>>faster with SMT on. Just because a chess engine can't use that very well >>>>>doesn't mean >>>>>that other applications without the search overhead issue won't benefit, and in >>>>>fact >>>>>they do benefit pretty well... >>>> >>>>diep's making perfect use of it. of course it's evaluating most of the time >>>>and if that's not in the trace cache that code, then the processor has >>>>the problem that it can only serve 1 process at a time. >>> >>>It only serves one at a time most of the time. It is interlacing micro-ops, but >>>most of >>>the time one thread is blocked waiting on a memory read, which is the slowest >>>thing >>>on the processor. >>> >>>> >>>>>The interesting thing I have noted is that the SMT benefit just about offsets my >>>>>parallel >>>>>search overhead for the typical case. If I run a single thread on my 2.8 xeon, >>>>>I get a search >>>>>time of X. If I run four threads, to use both cpus with HT enabled, I get a >>>>>search time of >>>>>very close to X/2. The 20-30% speedup by HT is just about what it takes to >>>>>offset the >>>> >>>>20-30% is quite overestimation of the speedup by HT. You gotta improve >>>>hashtable management a bit then. like 4 probes within the same cache line >>>>instead of a 2 table approach. >>> >>>Your 3-4 probe idea is not so good for reasons I mentioned already. You are >>>going to >>>_average_ two cache line reads anyway, because the first table entry is not >>>guaranteed to >>>be on the leading edge of the cache line. On average, you are going to load >>>stuff that is >>>_before_ the entry you want as well as after. And on average, with 16 byte >>>entries, I >>>would expect to see (with 32 byte cache lines) that one probe takes one cache >>>miss, >>>the second will take another cache miss 1/2 of the time. Four probes guarantees >>>that you >>>are going to get at least two cache misses most of the time, which is _exactly_ >>>what I get >>>with two separate tables. No loss or gain... >>> >>> >>> >>> >>>> >>>>crafty needs 1600 clocks a node on average (k7 timing, bit more for p4 >>>>but same idea). >>>> >>>>each ram reference (ddr ram) you lose like 400 clocks. >>>> >>>>if you use 1 table you lose 400 clocks and can do 4 probes in it. >>> >>>No you can't for the reason I gave above. Unless you force your hash probe >>>address to >>>be a multiple of the cache line size, so that you always get the first entry you >>>want in >>>the beginning of the cache line. >>> >>> >>>> >>>>if you use 2 tables as you do now you use 800 clocks. >>>> >>>>it is trivial that if the 2 extra threads in crafty can save out a few >>>>references to the DDR ram and get it from the L2 cache, that that's >>>>a pretty important basis for nodes a second win in crafty when using >>>>mt=4 at a dual xeon. >>> >>> >>>I can turn off the second probe and my nps doesn't increase much. I'll post the >>>data >>>later tonight. That simply means that the second probe is not a problem. And >>>the way >>>the pipeline works, both loads are sent off back-to-back so that there is not a >>>clean >>>400 clock wait for the first and another 400 clock wait for the second. >>> >>> >>>> >>>>>extra search overhead caused by the extra processor. Which means that for the >>>>>time being, >>>>>it is possible to search almost exactly twice as fast using two cpus, although >>>>>this comparison >>>>>is not exactly a correct way to compare things. >>>> >>>>no it's not possible to search 2 times faster. the problem is that the >>>>L1 cache and the trace cache are too small and it can't even feed 1 >>>>processor when decoding instructions, not to mention 2. >>> >>>It is _definitely_ possible. I'll post the data as I have already done this >>>once... >>>The drawback is that while it is searching 2x faster, it is using 4 threads, so >>>that 2x is >>>not exactly a fair comparison. >>> >>> >>>> >>>>apart from that there is other technical problems when both processors >>>>want something. >>> >>>So? this is common in operating system process scheduling also.. >>> >>> >>>> >>>>but 18% speedup from it is better than nothing. >>>> >>>>too bad that because of that 18% speedup the processor is 2 times more >>>>expensive. >>> >>> >>>The 2.8 xeons are going for around $450.00... >>> >>> >>>> >>> >>> >>>>>> >>>>>>Because a dual Xeon 2.8Ghz which i will assume also having a compare >>>>>>of 1.4 then (assuming not cas2 ddr ram but of course ecc registered >>>>>>which eats extra time) >>>>> >>>>>However the xeon has 2-way memory interleaving which runs the bandwidth way up >>>>>compared to the desktop PIV system. >>>> >>>>2 way memory interleaving for 4 processes doesn't kick butt. >>> >>>It does better than no interleaving, by a factor of 2.0... >>> >>> >>> >>> >>>> >>>>the problem is you lose time to the ECC and registered features of the >>>>memory you need for the dual. of course that's the case for all duals. >>>>both K7 MP and Xeon suffer from that regrettably. >>> >>>That is not true. The duals do _not_ have to have ECC ram. And it doesn't >>>appear to be >>>any slower than non-ECC ram although I will be able to test that before long as >>>we have >>>some non-ECC machines coming in. >>> >>> >>> >>>> >>>>A result is that single cpu tests can be carried out much faster in general. >>>> >>>>> >>>>>> >>>>>>That means that the equivalent K7 will be a dual K7 2.0Ghz, thereby >>>>>>still not taking into account 3 things >>>>>> >>>>>> a) my diep version was msvc compiled with processorpack (sp4) >>>>>> so it was simply not optimized for K7 at all, but more for p4 >>>>>> than it was optimized for K7. Not using MMX of course (would >>>>>> slow down on P4 and let the K7 look relatively better). >>>>>> b) speedup at 4 processors is a lot worse than at 2 processors >>>>>> so when i run diep with 4 processes at the dual Xeon 2.8 >>>>>> the expectation is that the K7 dual 2.0 Ghz will outgun it >>>>>> by quite some margin. >>>>>> c) that dual k7 2.0Ghz is less than half the price of a dual P4 2.8Ghz >>>>>> >>>>> >>>>>There are no dual PIV's at the moment. Only dual xeons. Xeons are _not_ >>>>>PIV's.... For several reasons that can be found on the Intel web site. That's >>>>>why >>>> >>>>they also fit in different slots for example. like 603 for xeon and 478 or >>>>something for the P4. >>> >>>xeon has 603 and 604 pin sockets to separate them. >>> >>>> >>>>>xeons are considered to be their "server class chips" while the PIV is their >>>>>"desktop >>>>>class chip". >>>> >>>>the core is the same however. So a 3.06Ghz Xeon when it gets released won't >>>>be faster single cpu when put single cpu in a mainboard than a P4 3.06Ghz. >>> >>>Probably will as the xeons use a different chipset which can support >>>interleaving where >>>the desktop chipsets don't. >>> >>>> >>>>With some luck by the time they release a 3.06Ghz Xeon they have improved >>>>the SMT another bit. >>>> >>>>Seems to me they working for years to get that SMT/HT slowly better working. >>> >>>Not "for years". It was announced as a coming thing a couple of years ago and >>>several >>>vendors have been discussing the idea. And they are going to increase the ratio >>>of physical >>>to logical cpus before long also... >>> >>> >>> >>>> >>>>> >>>>> >>>>>>Best regards, >>>>>>Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.