Author: enrico carrisco
Date: 17:36:22 11/10/05
Go up one level in this thread
On November 10, 2005 at 00:28:24, Robert Hyatt wrote: >On November 09, 2005 at 21:14:46, Vincent Diepeveen wrote: > >>On November 09, 2005 at 20:09:36, gerold daniels wrote: >> >>>On November 09, 2005 at 13:48:13, Robert Hyatt wrote: >>> >>>>On November 08, 2005 at 22:25:15, Vincent Diepeveen wrote: >>>> >>>>>On November 08, 2005 at 16:23:22, Gerd Isenberg wrote: >>>>> >>>>>>On November 08, 2005 at 15:26:45, Vincent Diepeveen wrote: >>>>>> >>>>>>>On November 08, 2005 at 13:17:42, Yar wrote: >>>>>>> >>>>>>>>Hello, >>>>>>>> >>>>>>>>Here is review (total 14 pages) of upcoming Intel's Xeon 5000 (Dempsey). Sorry >>>>>>>>its only in german. It seems its faster then Opeteron 280. >>>>>>>>http://www.tecchannel.de/server/hardware/432957/ >>>>>>>> >>>>>>>>With best regards, >>>>>>>> >>>>>>>>Yar >>>>>>> >>>>>>>It should be a fast cpu that Dempsey. However that Xeon will be there januari >>>>>>>2007 or so and it will have a price of i guess around 5000 euro a cpu in the >>>>>>>quad version, if you can get it for that, as you'll have to buy probably >>>>>>>1000 at a time to get them for around 4500 dollar a piece. >>>>>>> >>>>>>>So effectively a quad xeon dual core will be januari 2007 around $40k. >>>>>>> >>>>>>>By that time of course a quad opteron quad core is nearly 2 times faster >>>>>>>and exactly 2 times cheaper. >>>>>>> >>>>>>>Please note that it's not sure whether the IPC from the intel pentium-m at such >>>>>>>high clockspeeds and dual core will be better than from AMD. I'm counting at it >>>>>>>that it will be a lot slower, because in order to clock pentium-m higher, intel >>>>>>>will need to make the pipeline longer and will probably move from a 2 cycle L1 >>>>>>>to a 3 cycle L1. In which case the processor is similar to the opteron from >>>>>>>chessprogramming viewpoint. >>>>>>> >>>>>>>Of course the Xeons have bigger L2 or even L3 caches on chip than AMD. That's >>>>>>>nice for certain applications that are in benchmarks, but in reallife it's not a >>>>>>>huge advantage. >>>>>>> >>>>>>>A few MB's is plenty for computerchess at the moment. >>>>>>> >>>>>>>On the other hand, could you tell me whether this Xeon has an on die memory >>>>>>>controller or doesn't it have one? >>>>>>> >>>>>>>Because *that* matters a lot. Hashtables is a matter of TLB trashing memory >>>>>>>latencies to a big hashtable. With 64 bits cpu's and the clock that keeps >>>>>>>ticking, the RAM sizes will increase too, meaning that the latencies you lose to >>>>>>>TLB trashing (transpositiontable , eval table, not so much pawntable as that'll >>>>>>>be in L2 cache for majority of accesses) are significant. >>>>>>> >>>>>>>If intel plans to do that via some sort of chipset off chip, then that is a huge >>>>>>>drawback of this Xeon cpu for databases and chess. At database benchmarks, using >>>>>>>some small database they can get away with a big L2/L3 then, but in real life >>>>>>>there is no escape there. It's just dead slow. >>>>>>> >>>>>>>So i do look forward to pentium-m, but the price at which intel usually sells >>>>>>>good cpu's doesn't mean that we will see more quads online. >>>>>>> >>>>>>>Vincent >>>>>> >>>>>> >>>>>>Yes, memory latency seems worse. >>>>>> >>>>>>OTOH intel has more than two times better bandwith using 128-bit SSE2/3 >>>>>>load/store instructions, which is of course not so important for cumputer chess. >>>>>> >>>>>>Cache/Speicher: 128-Bit-Transfer >>>>>>Bandwidth in MByte/s >>>>>> >>>>>> Dempsey Paxville Opteron 280 >>>>>>L1 47340 41444 18360 >>>>>>L2 24928 22105 9448 >>>>>>Memory 3606 4127 3316 >>>>> >>>>>That's of course just paper. >>>>> >>>>>First of all at a quad machine, 8 cores at intel must share 3GB memory >>>>>bandwidth, which is *theoretic* bandwidth. >>>>> >>>>>This where 8 cores at quad opteron have 4 memory controllers. So that's a factor >>>>>4 advantage to opteron there in memory bandwidth. >>>>> >>>>>I didn't read bandwidth specs from L1&L2 cache of the intel chips. >>>>> >>>>>May i remind you that they had similar big heaven predictions for the P4 in the >>>>>past. It would have a 2 cycle L1 cache bla bla. >>>>> >>>>>Prescott actually has a 4 cycle L1 cache. >>>>> >>>>>P4 would execute 4 instructions a cycle, because of having 2 doubled clocked >>>>>integer units. >>>>> >>>>>Its practical limitations actually limit it to 3 instructions a cycle, and >>>>>nearly no one can get that, thanks to other limitations. >>>>> >>>>>We should all use CMOV constructs says intel, to avoid branch mispredictions. >>>>> >>>>>Actually their own compiler doesn't generate them when using P4 switches, >>>>>because at prescott a CMOV is at least 7 cycles penalty, versus 2 for AMD. >>>>> >>>>>So you can quote anything on paper here. The reality can be expressed in money >>>>>very easily. >>>>> >>>>>That's that those Xeon chips can never compete in terms of price against quad >>>>>core opterons, which will be on the market long before the DUAL core Xeon is >>>>>there. >>>>> >>>>>How can 4 cores of AMD ever be slower than 2 from intel. >>>>> >>>>>If you plan to stream for example SSE2 to processors executing all kind of code, >>>>>then obviously 4 cores of AMD always will win from 2 cores of intel. >>>>> >>>>>Especially if the AMD ones can run already for months when the intels still are >>>>>in the factory on a paper sheet. >>>>> >>>>>Please realize in terms of bandwidth for gflop calculations that memory is the >>>>>bottleneck. If 4 cpu's (8 cores) from intel can get at most 3 gigabyte a second, >>>>>then obviously AMD will always win when they can stream 12 gigabyte a second to >>>>>it. >>>>> >>>>>When on paper intel can receive 3.6GB and AMD on paper can receive 3.3GB a >>>>>second, that's not real relevant. >>>>> >>>>>It's 1 memory controller for Intel, versus 4 for AMD. >>>> >>>> >>>>That is if you have four processors. But the dual cores are sharing one >>>>controller, and the dual cores most definitely compete for that one >>>>hypertransport interface also since the multiple cache controllers and processor >>>>cores place a high demand on a single path (per chip, not per core) to memory. >>>>And then there is the issue of NUMA memory, which also reduces that high >>>>theoretical AMD bandwidth significantly... >>>> >>>>The dual-core chips are _not_ as good as two single-core chips, based on lots of >>>>benchmarking. They are very good, don't get me wrong, but the shared >>>>hypertransport means 2x the traffic thru one external interface, which can and >>>>does produce a bottleneck.. >>>> >>> >>>Thanks for clearing this up Robert. >> >>It's dead wrong of course. > >It isn't "dead wrong". > >AMD is aware of "issues" with dual cores with dual caches, with one HT >interface. The quad-cores will have a larger issue. Explain the "issues" -- how does this affect computer chess, whatsoever? Crafty (and any other program) would be lucky to use even 1/3 of the HT pipe. > > >> >>hypertransport isn't the bottleneck at all as that delivers hands down 14.4GB/s >>a channel. > >There is more to this than "bandwidth". There is "latency" and there are >"conflicts" > >> >>The latency from the dual cores is a lot worse however. If TLB trashing latency >>is a problem for your software, then dual cores will not give a precise 2 fold >>speedup but more like 1.90-1.99 or so. >> >>In the scaling graph from diep, based upon 200+ positions and a statistical >>significance (95%) the scaling of diep at a quad opteron dual core was 7.475 > >That was almost exactly what I got for Crafty for WCCC-level settings. I >actually produced 8.0X (or very close to it) NPS numbers, but those settings >internally were less efficient from the parallel search overhead and I didn't >play using them. > >> >>That gives a loss a cpu of less than 7%. >> >>However when simply running quad there is already a loss. > >I got almost perfect scaling on 4-way and 8-way single core systems. I had some >interesting issues to work around for dual core boxes.. Again, what "issues"? I agree that sharing the HT could become an issue with "certain applications" -- but not computer chess. The only other application I could think of is some of the intense gaming software but, to my knowledge, none of those are even SMP capable yet. -elc. > > > > >> >>I argue that the loss of a quad dual core is less or equal to a plain 8 fold >>single core machine. > >That I know is not true as I have run on both. The 8 single cores is measurably >better, but again, Crafty was my only benchmark. > > > > >> >>Please keep in mind that a latency of 234 nanoseconds at a quad dual core (when >>all 8 cores simultaneously ask 8 bytes from a random spot out of a 2GB buffer) >>is still a lot better than at a 8 fold intel Xeon which has a latency of far >>over 700 ns up to 900 ns for 2GB. > > >I'm not going to debate intel again. AMD can be significantly worse than 234. >Their page tables require 4 memory reads when a TLB miss occurs. Intel only >needs 2. And if you use large pages, both trim one memory read off that which >changes the numbers quite a bit. But on AMD, with TLB thrashing, you are going >to do 5 reads every time to translate the virtual to a real address and then >read the actual data. With big pages, that becomes 4. With Intel it becomes 2. > Those are not that far apart overall. > >Problem with Intel will be the non-NUMA memory that requires interleaving, and I >doubt they are going beyond 4-way interleaving, leaving 8 cpus to starve for >data... > >Also when you start sharing data, those remote caches become yet another >bottleneck, because they are all talking to each other very frequently to handle >the "owner state" in MOESI. This can become another issue if great care is not >taken in sharing data responsibly... > > > >> >>Vincent >> >>>> >>>>> >>>>>Now for games that are multithreaded and SSE2 calculations like in all kind of >>>>>graphics and such, that memory perfomance is a big performance hit. >>>>> >>>>>Additional, bandwidth in L1 cache for chess will be dominated by the LATENCY >>>>>that getting a single doubleword out of L1 eats and the number of reads you can >>>>>do simultaneously there. >>>>> >>>>>I remember the optimistic specs from the past from intel. They were not true. >>>>> >>>>>What will be the achillesheels this time? >>>>> >>>>>If there isn't, it's a killer cpu in that case for software that doesn't need >>>>>RAM! >>>>> >>>>>If there is again achillesheels, intel has a major problem then. >>>>> >>>>>But i do realize the price of those cpu's. Just look to the size of the L2 >>>>>cache! >>>>> >>>>>What was it 16MB or something? >>>>> >>>>>That's not gonna be CHEAP. >>>> >>>>Would not speculate there. As FAB sizes go down, transistor count goes up, with >>>>no increase in cost at all. It used to be "how can we squeeze all this stuff >>>>(L1/L2/floating point/multiple pipes/etc) into this small number of >>>>transistors?" It is now more of "what on earth can we use all these transistors >>>>for. At 6 transistors per bit for SRAM, a megabyte requires 6M transistors, >>>>which is chickenfeed... >>>> >>>> >>>> >>>>> >>>>>So whatever its performance, it won't be able to compete against AMD in that >>>>>sense. >>>>> >>>>>You wonder about SSE2 here. Well let me ask you, how many SSE2 execution units >>>>>does it have? >>>>> >>>>>We know AMD has 2. >>>>>P4 has 1. >>>>> >>>>>AMD completely outperforms P4 there. >>>>> >>>>>Why would this be different at a pentium-m at stereoids, can you give some >>>>>explanations? >>>>> >>>>>Let me give counter arguments. >>>>> >>>>>a) it will have an utmost TINY L1 cache >>>>>b) it will SHARE the L2 cache, so it has a DEAD SLOW L2 cache in terms of >>>>>latency. >>>> >>>>Shared L2 is not necessarily bad. On AMD the MOESI traffic can get _very_ high >>>>if the two processor cores are modifying data that is shared... >>>> >>>> >>>>>c) intel has the habit to try to get away with a very cheap L1 cache too, and >>>>>just make 1 port in it. AMD had aready at K7 2 ports and so has K8. >>>>> >>>>>What will intel do this time to keep this cpu a cheap cpu to produce, meanwhile >>>>>asking golden coins when you buy it? >>>>> >>>>>Who knows, perhaps intel has some good cpu now? >>>>> >>>>>Let's hope so. >>>>> >>>>>>Also general SSE-performance is much better for the future intels. >>>>>>Hopefully some motivation for amd to work on 128-bit alus ;-) >>>>>>Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.