Author: Eugene Nalimov
Date: 22:31:13 12/13/02
Go up one level in this thread
On December 14, 2002 at 00:56:19, Robert Hyatt wrote: >On December 13, 2002 at 23:27:02, Robert Hyatt wrote: > >>On December 13, 2002 at 23:19:22, Eugene Nalimov wrote: >> >>>P4 cache line size is 128 for L2 cache, 64 for L1 cache. >> >>Right. But is not any cache line miss that requires memory going to have >>to read 128 bytes? If it isn't in L1, and it isn't in L2, it is going to >>fill the L2 line, 128 bytes. If it is not in L1 but is in L2, then the >>memory read isn't needed and the latency issue for memory (as discussed >>here) doesn't apply?? >> > > > >I went back to study the PIV a bit more and it appears my initial thought >was correct. Everything that exists in L1 cache is guaranteed to also exist >in L2. And L2 sucks in data from memory and it is the only cache that talks >directly to memory. The 20+K L1 micro-op cache and the 8K L1 data cache >talk only to L2. (Intel says 12K micro-ops, which seems to translate into >roughly 21K bytes of micro-ops, just to make this compare with the older >PIII with 16K I and D cache (L1). PIII has more Dcache, but less Icache. > > >The only issue is that a couple of the linux guys once quoted the intel PIV >specs as saying 64 bytes/line for L2 as well as L1, which contradicts some- >thing I had read (I think) on the Intel web site. Perhaps this was tuned to >RDRAM but not done on DDR RAM. I will hedge on this until I find a clear >and precise answer, which I have not been able to do tonight, so far. "Intel Pentium 4 and Intel Xeon Processor Optimization". ftp://download.intel.com/design/Pentium4/manuals/24896607.pdf Table 1-1 on page 1-19. Thanks, Eugene > >> >> >> >>> >>>Thanks, >>>Eugene >>> >>>On December 13, 2002 at 22:56:17, Robert Hyatt wrote: >>> >>>>On December 13, 2002 at 16:41:07, Vincent Diepeveen wrote: >>>> >>>>>On December 13, 2002 at 16:03:47, Robert Hyatt wrote: >>>>> >>>>>your math is wrong for many reasons. >>>>> >>>>>0) it isn't 32 bytes but 64 bytes that you get at once >>>>> garantueed. >>>> >>>>Depends on the processor. For PIII and earlier, it is _32_ bytes. For >>>>the PIV it is 128 bytes. I think AMD is 64... >>>> >>>> >>>>>1) you can garantuee that you need just cacheline >>>> >>>>Yes you can, by making sure you probe to a starting address that is >>>>divisible by the cache line size exactly. Are you doing that? Are >>>>you sure your table is initially aligned on a multiple of cache line >>>>size? Didn't think so. You can't control malloc() that well yet... >>>>And it isn't smart enough to know it should do that, particularly when >>>>the alignment is processor dependent. >>>> >>>> >>>> >>>> >>>>>2) even if you need 2, then those aren't 400 clocks each >>>>> cache line but the first one is 400 and the second >>>>> one is only a very small part of that (consider the >>>>> bandwidth the memory delivers!) >>>> >>>>Try again. You burst load one cache line and that is all. The first 8 >>>>bytes comes across after a long latency. The rest of the line bursts in >>>>quickly. For the next cache miss, back you go to the long latency. >>>> >>>> >>>>>3) you get more transpositions from 4 probes. it works better than 2 >>>>> and a *lot* better. >>>> >>>>I don't think a "lot" better. I ran this test. The first Crafty versions >>>>used a N probe approach as that was what I did in Cray Blitz. Multiple >>>>probes are somewhat better. But not "a lot" better. And the memory bandwidth >>>>hurts. Probing consecutive addresses is bad from a theoretical hashing point >>>>of view as it leads to "chaining". Probing non-consecutive addresses is bad >>>>from a cache pre-fetch point of view. >>>> >>>>> >>>>>3) compare with the sureness of 2 very slow cache lines which you >>>>> get from seperated parts in memory which is a garantueed 800 clocks >>>>> or 50% of the total system time possibly. >>>> >>>> >>>>Ditto for yours. 800 clocks too... >>>> >>>> >>>>> >>>>>Best regards, >>>>>Vincent >>>>> >>>>>>On December 13, 2002 at 15:43:52, Vincent Diepeveen wrote: >>>>>> >>>>>>>On December 13, 2002 at 14:33:47, Robert Hyatt wrote: >>>>>>> >>>>>>>>On December 13, 2002 at 14:08:29, Vincent Diepeveen wrote: >>>>>>>> >>>>>>>>>hello, >>>>>>>>> >>>>>>>>>Here some testresults of DIEP thanks to Chad Cowan at an >>>>>>>>>asus motherboard with HT turned on (amazingly no longer >>>>>>>>>SMT called, i forgot which manufacturer calls it HT and >>>>>>>>>which one SMT. I guess it's Hyperthreading now for intel). >>>>>>>> >>>>>>>>It is _both_. SMT and HT. You can find either term listed on Intel's >>>>>>>>web site. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>HT turned on in all cases: >>>>>>>>> >>>>>>>>>bus 533Mhz memory 133Mhz (DDR SDRAM cas2) >>>>>>>>>single cpu P4 3.105Ghz (bus 135 Mhz by default, not 133) : 101394 >>>>>>>>>single cpu P4 3.105Ghz now 2 processes DIEP : 120095 >>>>>>>>> >>>>>>>>>So speedup like 18% for HT. Not bad. Not good either, knowing diep >>>>>>>>>hardly locks. >>>>>>>> >>>>>>>>It isn't just a lock issue. If both threads are banging on memory, it can't >>>>>>>>run much faster, as it still serializes the memory reads and they are slow. >>>>>>>> >>>>>>>> >>>>>>>>Hmm.. Aren't you the same person that was saying "hyper-threading doesn't >>>>>>>>work" and "hyper-threading only works on machines that won't be available >>>>>>>>for 1-2 years in the future"?? And that "Nalimov is running on a machine >>>>>>>>that nobody can buy" and "the 2.8ghz xeon doesn't support hyper-threading"? >>>>>>>>and so forth??? >>>>>>> >>>>>>>intel marketing department is already 2 years busy with a SMT campaign. >>>>>>>Only now in the USA there are a few 2.8Ghz Xeons and 3Ghz P4s. >>>>>>> >>>>>>>they say only the 3Ghz P4s and xeon mp's have HT/smt support whereas >>>>>>>the old SMT worked badly for me and diep. >>>>>> >>>>>>What "old SMT"?? There isn't any. >>>>>> >>>>>> >>>>>>> >>>>>>>also 18% speedup ain't much knowing their cpu is already quite some slower >>>>>>>than the k7 is. >>>>>>> >>>>>>>Would waitforsingleobject() in windoze speedup more than hammering into >>>>>>>the same cache line? >>>>>> >>>>>> >>>>>>Read the intel website about spinlocks. Spinlocks are better for short-duration >>>>>>waits. >>>>>>O/S blocking calls are better for long waits. This has always been the case, >>>>>>but since I >>>>>>don't have any "long wait times" I only use spins. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>>Means i lose some 2ms each process though to wake it up when it's asleep. >>>>>>> >>>>>>>Not a holy grail solution either seemingly. >>>>>> >>>>>>You don't have to take that penalty. Just spin "properly". >>>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>However there is 1 problem i have with it when i compare that speed >>>>>>>>>of the same version with 2.4Ghz northwood. >>>>>>>>> >>>>>>>>>That 2.4Ghz is exactly the speed of a K7 at 1.6ghz >>>>>>>>> >>>>>>>>>Now the same K7 same version logs: >>>>>>>>> single cpu : 82499 >>>>>>>>> dual : 154293 >>>>>>>>> >>>>>>>>>Note that the k7 has way way slower RAM and chipset. 133Mhz registered cas 2.5 >>>>>>>>>i guess versus fast cas 2 (like 2T less for latency, so 10 versus 12T or >>>>>>>>>something) for the P4. The P4 was a single cpu. >>>>>>>>> >>>>>>>>>but here the math for those who still read here that's interesting to >>>>>>>>>hear. >>>>>>>>> >>>>>>>>>Single cpu speed difference is: >>>>>>>>> P4 3.06Ghz is faster : 22.9% >>>>>>>>> >>>>>>>>>Based upon the speed where it is clocked at (3105Mhz) >>>>>>>>>we would expect a speedup of 3.105 / 2.4 = 29.4% >>>>>>>>> >>>>>>>>>So somehow we lose around 7% in the process. >>>>>>>> >>>>>>>>Memory is no faster. So there is going to be a loss every time the cpu clock is >>>>>>>>ramped up a notch. Always has been. Always will be until DRAM disappears. >>>>>>> >>>>>>>Yes i/o bottleneck will be more and more getting a problem. >>>>>>> >>>>>>>>> >>>>>>>>>Now it wins another 18% or so when it gets run with 2 processes. >>>>>>>>>If i compare that with a single cpu K7 to get the relative >>>>>>>>>speed of a P4 Ghz versus a K7 Ghz then we get next compare: >>>>>>>>> >>>>>>>>>1.6Ghz * (120k / 82k) = 2.33Ghz >>>>>>>>> >>>>>>>>>so a 2.33Ghz K7 should be equally fast to a P4 at such a speed. >>>>>>>>>Of course assuming linearly scaling. >>>>>>>>> >>>>>>>>>Now we calculate what 1Ghz K7 compares to in speed with P4: 1.33 >>>>>>>>> >>>>>>>>>So DDR ram proves to be the big winner for the P4. SMT in itself >>>>>>>>>is just a trick that works for me because my parallellism is >>>>>>>>>pretty ok and most likely not for everyone. >>>>>>>> >>>>>>>>Works just fine for me too, as I have already reported and as has Eugene >>>>>>>>and others... >>>>>>> >>>>>>>initially i thought P4 would suck forever, but i didn't realize the >>>>>>>major negative impact RDRAM had at the time. >>>>>>> >>>>>>>>> >>>>>>>>>Now of course it's questionable whether that 18% speedup in nodes >>>>>>>>>a second also results in actual positive speedup in plydepth. >>>>>>>>> >>>>>>>>>For DIEP it is, but it's not so impressive at all. >>>>>>>> >>>>>>>>Nope. but that is not a processor issue, that is a search issue. The _cpu_ is >>>>>>>>faster with SMT on. Just because a chess engine can't use that very well >>>>>>>>doesn't mean >>>>>>>>that other applications without the search overhead issue won't benefit, and in >>>>>>>>fact >>>>>>>>they do benefit pretty well... >>>>>>> >>>>>>>diep's making perfect use of it. of course it's evaluating most of the time >>>>>>>and if that's not in the trace cache that code, then the processor has >>>>>>>the problem that it can only serve 1 process at a time. >>>>>> >>>>>>It only serves one at a time most of the time. It is interlacing micro-ops, but >>>>>>most of >>>>>>the time one thread is blocked waiting on a memory read, which is the slowest >>>>>>thing >>>>>>on the processor. >>>>>> >>>>>>> >>>>>>>>The interesting thing I have noted is that the SMT benefit just about offsets my >>>>>>>>parallel >>>>>>>>search overhead for the typical case. If I run a single thread on my 2.8 xeon, >>>>>>>>I get a search >>>>>>>>time of X. If I run four threads, to use both cpus with HT enabled, I get a >>>>>>>>search time of >>>>>>>>very close to X/2. The 20-30% speedup by HT is just about what it takes to >>>>>>>>offset the >>>>>>> >>>>>>>20-30% is quite overestimation of the speedup by HT. You gotta improve >>>>>>>hashtable management a bit then. like 4 probes within the same cache line >>>>>>>instead of a 2 table approach. >>>>>> >>>>>>Your 3-4 probe idea is not so good for reasons I mentioned already. You are >>>>>>going to >>>>>>_average_ two cache line reads anyway, because the first table entry is not >>>>>>guaranteed to >>>>>>be on the leading edge of the cache line. On average, you are going to load >>>>>>stuff that is >>>>>>_before_ the entry you want as well as after. And on average, with 16 byte >>>>>>entries, I >>>>>>would expect to see (with 32 byte cache lines) that one probe takes one cache >>>>>>miss, >>>>>>the second will take another cache miss 1/2 of the time. Four probes guarantees >>>>>>that you >>>>>>are going to get at least two cache misses most of the time, which is _exactly_ >>>>>>what I get >>>>>>with two separate tables. No loss or gain... >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>>crafty needs 1600 clocks a node on average (k7 timing, bit more for p4 >>>>>>>but same idea). >>>>>>> >>>>>>>each ram reference (ddr ram) you lose like 400 clocks. >>>>>>> >>>>>>>if you use 1 table you lose 400 clocks and can do 4 probes in it. >>>>>> >>>>>>No you can't for the reason I gave above. Unless you force your hash probe >>>>>>address to >>>>>>be a multiple of the cache line size, so that you always get the first entry you >>>>>>want in >>>>>>the beginning of the cache line. >>>>>> >>>>>> >>>>>>> >>>>>>>if you use 2 tables as you do now you use 800 clocks. >>>>>>> >>>>>>>it is trivial that if the 2 extra threads in crafty can save out a few >>>>>>>references to the DDR ram and get it from the L2 cache, that that's >>>>>>>a pretty important basis for nodes a second win in crafty when using >>>>>>>mt=4 at a dual xeon. >>>>>> >>>>>> >>>>>>I can turn off the second probe and my nps doesn't increase much. I'll post the >>>>>>data >>>>>>later tonight. That simply means that the second probe is not a problem. And >>>>>>the way >>>>>>the pipeline works, both loads are sent off back-to-back so that there is not a >>>>>>clean >>>>>>400 clock wait for the first and another 400 clock wait for the second. >>>>>> >>>>>> >>>>>>> >>>>>>>>extra search overhead caused by the extra processor. Which means that for the >>>>>>>>time being, >>>>>>>>it is possible to search almost exactly twice as fast using two cpus, although >>>>>>>>this comparison >>>>>>>>is not exactly a correct way to compare things. >>>>>>> >>>>>>>no it's not possible to search 2 times faster. the problem is that the >>>>>>>L1 cache and the trace cache are too small and it can't even feed 1 >>>>>>>processor when decoding instructions, not to mention 2. >>>>>> >>>>>>It is _definitely_ possible. I'll post the data as I have already done this >>>>>>once... >>>>>>The drawback is that while it is searching 2x faster, it is using 4 threads, so >>>>>>that 2x is >>>>>>not exactly a fair comparison. >>>>>> >>>>>> >>>>>>> >>>>>>>apart from that there is other technical problems when both processors >>>>>>>want something. >>>>>> >>>>>>So? this is common in operating system process scheduling also.. >>>>>> >>>>>> >>>>>>> >>>>>>>but 18% speedup from it is better than nothing. >>>>>>> >>>>>>>too bad that because of that 18% speedup the processor is 2 times more >>>>>>>expensive. >>>>>> >>>>>> >>>>>>The 2.8 xeons are going for around $450.00... >>>>>> >>>>>> >>>>>>> >>>>>> >>>>>> >>>>>>>>> >>>>>>>>>Because a dual Xeon 2.8Ghz which i will assume also having a compare >>>>>>>>>of 1.4 then (assuming not cas2 ddr ram but of course ecc registered >>>>>>>>>which eats extra time) >>>>>>>> >>>>>>>>However the xeon has 2-way memory interleaving which runs the bandwidth way up >>>>>>>>compared to the desktop PIV system. >>>>>>> >>>>>>>2 way memory interleaving for 4 processes doesn't kick butt. >>>>>> >>>>>>It does better than no interleaving, by a factor of 2.0... >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>>the problem is you lose time to the ECC and registered features of the >>>>>>>memory you need for the dual. of course that's the case for all duals. >>>>>>>both K7 MP and Xeon suffer from that regrettably. >>>>>> >>>>>>That is not true. The duals do _not_ have to have ECC ram. And it doesn't >>>>>>appear to be >>>>>>any slower than non-ECC ram although I will be able to test that before long as >>>>>>we have >>>>>>some non-ECC machines coming in. >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>>A result is that single cpu tests can be carried out much faster in general. >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>That means that the equivalent K7 will be a dual K7 2.0Ghz, thereby >>>>>>>>>still not taking into account 3 things >>>>>>>>> >>>>>>>>> a) my diep version was msvc compiled with processorpack (sp4) >>>>>>>>> so it was simply not optimized for K7 at all, but more for p4 >>>>>>>>> than it was optimized for K7. Not using MMX of course (would >>>>>>>>> slow down on P4 and let the K7 look relatively better). >>>>>>>>> b) speedup at 4 processors is a lot worse than at 2 processors >>>>>>>>> so when i run diep with 4 processes at the dual Xeon 2.8 >>>>>>>>> the expectation is that the K7 dual 2.0 Ghz will outgun it >>>>>>>>> by quite some margin. >>>>>>>>> c) that dual k7 2.0Ghz is less than half the price of a dual P4 2.8Ghz >>>>>>>>> >>>>>>>> >>>>>>>>There are no dual PIV's at the moment. Only dual xeons. Xeons are _not_ >>>>>>>>PIV's.... For several reasons that can be found on the Intel web site. That's >>>>>>>>why >>>>>>> >>>>>>>they also fit in different slots for example. like 603 for xeon and 478 or >>>>>>>something for the P4. >>>>>> >>>>>>xeon has 603 and 604 pin sockets to separate them. >>>>>> >>>>>>> >>>>>>>>xeons are considered to be their "server class chips" while the PIV is their >>>>>>>>"desktop >>>>>>>>class chip". >>>>>>> >>>>>>>the core is the same however. So a 3.06Ghz Xeon when it gets released won't >>>>>>>be faster single cpu when put single cpu in a mainboard than a P4 3.06Ghz. >>>>>> >>>>>>Probably will as the xeons use a different chipset which can support >>>>>>interleaving where >>>>>>the desktop chipsets don't. >>>>>> >>>>>>> >>>>>>>With some luck by the time they release a 3.06Ghz Xeon they have improved >>>>>>>the SMT another bit. >>>>>>> >>>>>>>Seems to me they working for years to get that SMT/HT slowly better working. >>>>>> >>>>>>Not "for years". It was announced as a coming thing a couple of years ago and >>>>>>several >>>>>>vendors have been discussing the idea. And they are going to increase the ratio >>>>>>of physical >>>>>>to logical cpus before long also... >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>Best regards, >>>>>>>>>Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.