Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Is Xeon 5000 (Dempsey) with FB-DIMMs faster then Opteron 280 ?

Author: Vincent Diepeveen

Date: 18:14:46 11/09/05

Go up one level in this thread


On November 09, 2005 at 20:09:36, gerold daniels wrote:

>On November 09, 2005 at 13:48:13, Robert Hyatt wrote:
>
>>On November 08, 2005 at 22:25:15, Vincent Diepeveen wrote:
>>
>>>On November 08, 2005 at 16:23:22, Gerd Isenberg wrote:
>>>
>>>>On November 08, 2005 at 15:26:45, Vincent Diepeveen wrote:
>>>>
>>>>>On November 08, 2005 at 13:17:42, Yar wrote:
>>>>>
>>>>>>Hello,
>>>>>>
>>>>>>Here is review (total 14 pages) of upcoming Intel's Xeon 5000 (Dempsey). Sorry
>>>>>>its only in german. It seems its faster then Opeteron 280.
>>>>>>http://www.tecchannel.de/server/hardware/432957/
>>>>>>
>>>>>>With best regards,
>>>>>>
>>>>>>Yar
>>>>>
>>>>>It should be a fast cpu that Dempsey. However that Xeon will be there januari
>>>>>2007 or so and it will have a price of i guess around 5000 euro a cpu in the
>>>>>quad version, if you can get it for that, as you'll have to buy probably
>>>>>1000 at a time to get them for around 4500 dollar a piece.
>>>>>
>>>>>So effectively a quad xeon dual core will be januari 2007 around $40k.
>>>>>
>>>>>By that time of course a quad opteron quad core is nearly 2 times faster
>>>>>and exactly 2 times cheaper.
>>>>>
>>>>>Please note that it's not sure whether the IPC from the intel pentium-m at such
>>>>>high clockspeeds and dual core will be better than from AMD. I'm counting at it
>>>>>that it will be a lot slower, because in order to clock pentium-m higher, intel
>>>>>will need to make the pipeline longer and will probably  move from a 2 cycle L1
>>>>>to a 3 cycle L1. In which case the processor is similar to the opteron from
>>>>>chessprogramming viewpoint.
>>>>>
>>>>>Of course the Xeons have bigger L2 or even L3 caches on chip than AMD. That's
>>>>>nice for certain applications that are in benchmarks, but in reallife it's not a
>>>>>huge advantage.
>>>>>
>>>>>A few MB's is plenty for computerchess at the moment.
>>>>>
>>>>>On the other hand, could you tell me whether this Xeon has an on die memory
>>>>>controller or doesn't it have one?
>>>>>
>>>>>Because *that* matters a lot. Hashtables is a matter of TLB trashing memory
>>>>>latencies to a big hashtable. With 64 bits cpu's and the clock that keeps
>>>>>ticking, the RAM sizes will increase too, meaning that the latencies you lose to
>>>>>TLB trashing (transpositiontable , eval table, not so much pawntable as that'll
>>>>>be in L2 cache for majority of accesses) are significant.
>>>>>
>>>>>If intel plans to do that via some sort of chipset off chip, then that is a huge
>>>>>drawback of this Xeon cpu for databases and chess. At database benchmarks, using
>>>>>some small database they can get away with a big L2/L3 then, but in real life
>>>>>there is no escape there. It's just dead slow.
>>>>>
>>>>>So i do look forward to pentium-m, but the price at which intel usually sells
>>>>>good cpu's doesn't mean that we will see more quads online.
>>>>>
>>>>>Vincent
>>>>
>>>>
>>>>Yes, memory latency seems worse.
>>>>
>>>>OTOH intel has more than two times better bandwith using 128-bit SSE2/3
>>>>load/store instructions, which is of course not so important for cumputer chess.
>>>>
>>>>Cache/Speicher: 128-Bit-Transfer
>>>>Bandwidth in MByte/s
>>>>
>>>>           Dempsey Paxville Opteron 280
>>>>L1          47340    41444    18360
>>>>L2          24928    22105     9448
>>>>Memory       3606     4127     3316
>>>
>>>That's of course just paper.
>>>
>>>First of all at a quad machine, 8 cores at intel must share 3GB memory
>>>bandwidth, which is *theoretic* bandwidth.
>>>
>>>This where 8 cores at quad opteron have 4 memory controllers. So that's a factor
>>>4 advantage to opteron there in memory bandwidth.
>>>
>>>I didn't read bandwidth specs from L1&L2 cache of the intel chips.
>>>
>>>May i remind you that they had similar big heaven predictions for the P4 in the
>>>past. It would have a 2 cycle L1 cache bla bla.
>>>
>>>Prescott actually has a 4 cycle L1 cache.
>>>
>>>P4 would execute 4 instructions a cycle, because of having 2 doubled clocked
>>>integer units.
>>>
>>>Its practical limitations actually limit it to 3 instructions a cycle, and
>>>nearly no one can get that, thanks to other limitations.
>>>
>>>We should all use CMOV constructs says intel, to avoid branch mispredictions.
>>>
>>>Actually their own compiler doesn't generate them when using P4 switches,
>>>because at prescott a CMOV is at least 7 cycles penalty, versus 2 for AMD.
>>>
>>>So you can quote anything on paper here. The reality can be expressed in money
>>>very easily.
>>>
>>>That's that those Xeon chips can never compete in terms of price against quad
>>>core opterons, which will be on the market long before the DUAL core Xeon is
>>>there.
>>>
>>>How can 4 cores of AMD ever be slower than 2 from intel.
>>>
>>>If you plan to stream for example SSE2 to processors executing all kind of code,
>>>then obviously 4 cores of AMD always will win from 2 cores of intel.
>>>
>>>Especially if the AMD ones can run already for months when the intels still are
>>>in the factory on a paper sheet.
>>>
>>>Please realize in terms of bandwidth for gflop calculations that memory is the
>>>bottleneck. If 4 cpu's (8 cores) from intel can get at most 3 gigabyte a second,
>>>then obviously AMD will always win when they can stream 12 gigabyte a second to
>>>it.
>>>
>>>When on paper intel can receive 3.6GB and AMD on paper can receive 3.3GB a
>>>second, that's not real relevant.
>>>
>>>It's 1 memory controller for Intel, versus 4 for AMD.
>>
>>
>>That is if you have four processors.  But the dual cores are sharing one
>>controller, and the dual cores most definitely compete for that one
>>hypertransport interface also since the multiple cache controllers and processor
>>cores place a high demand on a single path (per chip, not per core) to memory.
>>And then there is the issue of NUMA memory, which also reduces that high
>>theoretical AMD bandwidth significantly...
>>
>>The dual-core chips are _not_ as good as two single-core chips, based on lots of
>>benchmarking.  They are very good, don't get me wrong, but the shared
>>hypertransport means 2x the traffic thru one external interface, which can and
>>does produce a bottleneck..
>>
>
>Thanks for clearing this up Robert.

It's dead wrong of course.

hypertransport isn't the bottleneck at all as that delivers hands down 14.4GB/s
a channel.

The latency from the dual cores is a lot worse however. If TLB trashing latency
is a problem for your software, then dual cores will not give a precise 2 fold
speedup but more like 1.90-1.99 or so.

In the scaling graph from diep, based upon 200+ positions and a statistical
significance (95%) the scaling of diep at a quad opteron dual core was 7.475

That gives a loss a cpu of less than 7%.

However when simply running quad there is already a loss.

I argue that the loss of a quad dual core is less or equal to a plain 8 fold
single core machine.

Please keep in mind that a latency of 234 nanoseconds at a quad dual core (when
all 8 cores simultaneously ask 8 bytes from a random spot out of a 2GB buffer)
is still a lot better than at a 8 fold intel Xeon which has a latency of far
over 700 ns up to 900 ns for 2GB.

Vincent

>>
>>>
>>>Now for games that are multithreaded and SSE2 calculations like in all kind of
>>>graphics and such, that memory perfomance is a big performance hit.
>>>
>>>Additional, bandwidth in L1 cache for chess will be dominated by the LATENCY
>>>that getting a single doubleword out of L1 eats and the number of reads you can
>>>do simultaneously there.
>>>
>>>I remember the optimistic specs from the past from intel. They were not true.
>>>
>>>What will be the achillesheels this time?
>>>
>>>If there isn't, it's a killer cpu in that case for software that doesn't need
>>>RAM!
>>>
>>>If there is again achillesheels, intel has a major problem then.
>>>
>>>But i do realize the price of those cpu's. Just look to the size of the L2
>>>cache!
>>>
>>>What was it 16MB or something?
>>>
>>>That's not gonna be CHEAP.
>>
>>Would not speculate there.  As FAB sizes go down, transistor count goes up, with
>>no increase in cost at all.  It used to be "how can we squeeze all this stuff
>>(L1/L2/floating point/multiple pipes/etc) into this small number of
>>transistors?"  It is now more of "what on earth can we use all these transistors
>>for.  At 6 transistors per bit for SRAM, a megabyte requires 6M transistors,
>>which is chickenfeed...
>>
>>
>>
>>>
>>>So whatever its performance, it won't be able to compete against AMD in that
>>>sense.
>>>
>>>You wonder about SSE2 here. Well let me ask you, how many SSE2 execution units
>>>does it have?
>>>
>>>We know AMD has 2.
>>>P4 has 1.
>>>
>>>AMD completely outperforms P4 there.
>>>
>>>Why would this be different at a pentium-m at stereoids, can you give some
>>>explanations?
>>>
>>>Let me give counter arguments.
>>>
>>>a) it will have an utmost TINY L1 cache
>>>b) it will SHARE the L2 cache, so it has a DEAD SLOW L2 cache in terms of
>>>latency.
>>
>>Shared L2 is not necessarily bad.  On AMD the MOESI traffic can get _very_ high
>>if the two processor cores are modifying data that is shared...
>>
>>
>>>c) intel has the habit to try to get away with a very cheap L1 cache too, and
>>>just make 1 port in it. AMD had aready at K7 2 ports and so has K8.
>>>
>>>What will intel do this time to keep this cpu a cheap cpu to produce, meanwhile
>>>asking golden coins when you buy it?
>>>
>>>Who knows, perhaps intel has some good cpu now?
>>>
>>>Let's hope so.
>>>
>>>>Also general SSE-performance is much better for the future intels.
>>>>Hopefully some motivation for amd to work on 128-bit alus ;-)
>>>>Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.