Author: Vincent Diepeveen
Date: 18:14:46 11/09/05
Go up one level in this thread
On November 09, 2005 at 20:09:36, gerold daniels wrote: >On November 09, 2005 at 13:48:13, Robert Hyatt wrote: > >>On November 08, 2005 at 22:25:15, Vincent Diepeveen wrote: >> >>>On November 08, 2005 at 16:23:22, Gerd Isenberg wrote: >>> >>>>On November 08, 2005 at 15:26:45, Vincent Diepeveen wrote: >>>> >>>>>On November 08, 2005 at 13:17:42, Yar wrote: >>>>> >>>>>>Hello, >>>>>> >>>>>>Here is review (total 14 pages) of upcoming Intel's Xeon 5000 (Dempsey). Sorry >>>>>>its only in german. It seems its faster then Opeteron 280. >>>>>>http://www.tecchannel.de/server/hardware/432957/ >>>>>> >>>>>>With best regards, >>>>>> >>>>>>Yar >>>>> >>>>>It should be a fast cpu that Dempsey. However that Xeon will be there januari >>>>>2007 or so and it will have a price of i guess around 5000 euro a cpu in the >>>>>quad version, if you can get it for that, as you'll have to buy probably >>>>>1000 at a time to get them for around 4500 dollar a piece. >>>>> >>>>>So effectively a quad xeon dual core will be januari 2007 around $40k. >>>>> >>>>>By that time of course a quad opteron quad core is nearly 2 times faster >>>>>and exactly 2 times cheaper. >>>>> >>>>>Please note that it's not sure whether the IPC from the intel pentium-m at such >>>>>high clockspeeds and dual core will be better than from AMD. I'm counting at it >>>>>that it will be a lot slower, because in order to clock pentium-m higher, intel >>>>>will need to make the pipeline longer and will probably move from a 2 cycle L1 >>>>>to a 3 cycle L1. In which case the processor is similar to the opteron from >>>>>chessprogramming viewpoint. >>>>> >>>>>Of course the Xeons have bigger L2 or even L3 caches on chip than AMD. That's >>>>>nice for certain applications that are in benchmarks, but in reallife it's not a >>>>>huge advantage. >>>>> >>>>>A few MB's is plenty for computerchess at the moment. >>>>> >>>>>On the other hand, could you tell me whether this Xeon has an on die memory >>>>>controller or doesn't it have one? >>>>> >>>>>Because *that* matters a lot. Hashtables is a matter of TLB trashing memory >>>>>latencies to a big hashtable. With 64 bits cpu's and the clock that keeps >>>>>ticking, the RAM sizes will increase too, meaning that the latencies you lose to >>>>>TLB trashing (transpositiontable , eval table, not so much pawntable as that'll >>>>>be in L2 cache for majority of accesses) are significant. >>>>> >>>>>If intel plans to do that via some sort of chipset off chip, then that is a huge >>>>>drawback of this Xeon cpu for databases and chess. At database benchmarks, using >>>>>some small database they can get away with a big L2/L3 then, but in real life >>>>>there is no escape there. It's just dead slow. >>>>> >>>>>So i do look forward to pentium-m, but the price at which intel usually sells >>>>>good cpu's doesn't mean that we will see more quads online. >>>>> >>>>>Vincent >>>> >>>> >>>>Yes, memory latency seems worse. >>>> >>>>OTOH intel has more than two times better bandwith using 128-bit SSE2/3 >>>>load/store instructions, which is of course not so important for cumputer chess. >>>> >>>>Cache/Speicher: 128-Bit-Transfer >>>>Bandwidth in MByte/s >>>> >>>> Dempsey Paxville Opteron 280 >>>>L1 47340 41444 18360 >>>>L2 24928 22105 9448 >>>>Memory 3606 4127 3316 >>> >>>That's of course just paper. >>> >>>First of all at a quad machine, 8 cores at intel must share 3GB memory >>>bandwidth, which is *theoretic* bandwidth. >>> >>>This where 8 cores at quad opteron have 4 memory controllers. So that's a factor >>>4 advantage to opteron there in memory bandwidth. >>> >>>I didn't read bandwidth specs from L1&L2 cache of the intel chips. >>> >>>May i remind you that they had similar big heaven predictions for the P4 in the >>>past. It would have a 2 cycle L1 cache bla bla. >>> >>>Prescott actually has a 4 cycle L1 cache. >>> >>>P4 would execute 4 instructions a cycle, because of having 2 doubled clocked >>>integer units. >>> >>>Its practical limitations actually limit it to 3 instructions a cycle, and >>>nearly no one can get that, thanks to other limitations. >>> >>>We should all use CMOV constructs says intel, to avoid branch mispredictions. >>> >>>Actually their own compiler doesn't generate them when using P4 switches, >>>because at prescott a CMOV is at least 7 cycles penalty, versus 2 for AMD. >>> >>>So you can quote anything on paper here. The reality can be expressed in money >>>very easily. >>> >>>That's that those Xeon chips can never compete in terms of price against quad >>>core opterons, which will be on the market long before the DUAL core Xeon is >>>there. >>> >>>How can 4 cores of AMD ever be slower than 2 from intel. >>> >>>If you plan to stream for example SSE2 to processors executing all kind of code, >>>then obviously 4 cores of AMD always will win from 2 cores of intel. >>> >>>Especially if the AMD ones can run already for months when the intels still are >>>in the factory on a paper sheet. >>> >>>Please realize in terms of bandwidth for gflop calculations that memory is the >>>bottleneck. If 4 cpu's (8 cores) from intel can get at most 3 gigabyte a second, >>>then obviously AMD will always win when they can stream 12 gigabyte a second to >>>it. >>> >>>When on paper intel can receive 3.6GB and AMD on paper can receive 3.3GB a >>>second, that's not real relevant. >>> >>>It's 1 memory controller for Intel, versus 4 for AMD. >> >> >>That is if you have four processors. But the dual cores are sharing one >>controller, and the dual cores most definitely compete for that one >>hypertransport interface also since the multiple cache controllers and processor >>cores place a high demand on a single path (per chip, not per core) to memory. >>And then there is the issue of NUMA memory, which also reduces that high >>theoretical AMD bandwidth significantly... >> >>The dual-core chips are _not_ as good as two single-core chips, based on lots of >>benchmarking. They are very good, don't get me wrong, but the shared >>hypertransport means 2x the traffic thru one external interface, which can and >>does produce a bottleneck.. >> > >Thanks for clearing this up Robert. It's dead wrong of course. hypertransport isn't the bottleneck at all as that delivers hands down 14.4GB/s a channel. The latency from the dual cores is a lot worse however. If TLB trashing latency is a problem for your software, then dual cores will not give a precise 2 fold speedup but more like 1.90-1.99 or so. In the scaling graph from diep, based upon 200+ positions and a statistical significance (95%) the scaling of diep at a quad opteron dual core was 7.475 That gives a loss a cpu of less than 7%. However when simply running quad there is already a loss. I argue that the loss of a quad dual core is less or equal to a plain 8 fold single core machine. Please keep in mind that a latency of 234 nanoseconds at a quad dual core (when all 8 cores simultaneously ask 8 bytes from a random spot out of a 2GB buffer) is still a lot better than at a 8 fold intel Xeon which has a latency of far over 700 ns up to 900 ns for 2GB. Vincent >> >>> >>>Now for games that are multithreaded and SSE2 calculations like in all kind of >>>graphics and such, that memory perfomance is a big performance hit. >>> >>>Additional, bandwidth in L1 cache for chess will be dominated by the LATENCY >>>that getting a single doubleword out of L1 eats and the number of reads you can >>>do simultaneously there. >>> >>>I remember the optimistic specs from the past from intel. They were not true. >>> >>>What will be the achillesheels this time? >>> >>>If there isn't, it's a killer cpu in that case for software that doesn't need >>>RAM! >>> >>>If there is again achillesheels, intel has a major problem then. >>> >>>But i do realize the price of those cpu's. Just look to the size of the L2 >>>cache! >>> >>>What was it 16MB or something? >>> >>>That's not gonna be CHEAP. >> >>Would not speculate there. As FAB sizes go down, transistor count goes up, with >>no increase in cost at all. It used to be "how can we squeeze all this stuff >>(L1/L2/floating point/multiple pipes/etc) into this small number of >>transistors?" It is now more of "what on earth can we use all these transistors >>for. At 6 transistors per bit for SRAM, a megabyte requires 6M transistors, >>which is chickenfeed... >> >> >> >>> >>>So whatever its performance, it won't be able to compete against AMD in that >>>sense. >>> >>>You wonder about SSE2 here. Well let me ask you, how many SSE2 execution units >>>does it have? >>> >>>We know AMD has 2. >>>P4 has 1. >>> >>>AMD completely outperforms P4 there. >>> >>>Why would this be different at a pentium-m at stereoids, can you give some >>>explanations? >>> >>>Let me give counter arguments. >>> >>>a) it will have an utmost TINY L1 cache >>>b) it will SHARE the L2 cache, so it has a DEAD SLOW L2 cache in terms of >>>latency. >> >>Shared L2 is not necessarily bad. On AMD the MOESI traffic can get _very_ high >>if the two processor cores are modifying data that is shared... >> >> >>>c) intel has the habit to try to get away with a very cheap L1 cache too, and >>>just make 1 port in it. AMD had aready at K7 2 ports and so has K8. >>> >>>What will intel do this time to keep this cpu a cheap cpu to produce, meanwhile >>>asking golden coins when you buy it? >>> >>>Who knows, perhaps intel has some good cpu now? >>> >>>Let's hope so. >>> >>>>Also general SSE-performance is much better for the future intels. >>>>Hopefully some motivation for amd to work on 128-bit alus ;-) >>>>Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.