Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Is Xeon 5000 (Dempsey) with FB-DIMMs faster then Opteron 280 ?

Author: Robert Hyatt

Date: 21:28:24 11/09/05

Go up one level in this thread


On November 09, 2005 at 21:14:46, Vincent Diepeveen wrote:

>On November 09, 2005 at 20:09:36, gerold daniels wrote:
>
>>On November 09, 2005 at 13:48:13, Robert Hyatt wrote:
>>
>>>On November 08, 2005 at 22:25:15, Vincent Diepeveen wrote:
>>>
>>>>On November 08, 2005 at 16:23:22, Gerd Isenberg wrote:
>>>>
>>>>>On November 08, 2005 at 15:26:45, Vincent Diepeveen wrote:
>>>>>
>>>>>>On November 08, 2005 at 13:17:42, Yar wrote:
>>>>>>
>>>>>>>Hello,
>>>>>>>
>>>>>>>Here is review (total 14 pages) of upcoming Intel's Xeon 5000 (Dempsey). Sorry
>>>>>>>its only in german. It seems its faster then Opeteron 280.
>>>>>>>http://www.tecchannel.de/server/hardware/432957/
>>>>>>>
>>>>>>>With best regards,
>>>>>>>
>>>>>>>Yar
>>>>>>
>>>>>>It should be a fast cpu that Dempsey. However that Xeon will be there januari
>>>>>>2007 or so and it will have a price of i guess around 5000 euro a cpu in the
>>>>>>quad version, if you can get it for that, as you'll have to buy probably
>>>>>>1000 at a time to get them for around 4500 dollar a piece.
>>>>>>
>>>>>>So effectively a quad xeon dual core will be januari 2007 around $40k.
>>>>>>
>>>>>>By that time of course a quad opteron quad core is nearly 2 times faster
>>>>>>and exactly 2 times cheaper.
>>>>>>
>>>>>>Please note that it's not sure whether the IPC from the intel pentium-m at such
>>>>>>high clockspeeds and dual core will be better than from AMD. I'm counting at it
>>>>>>that it will be a lot slower, because in order to clock pentium-m higher, intel
>>>>>>will need to make the pipeline longer and will probably  move from a 2 cycle L1
>>>>>>to a 3 cycle L1. In which case the processor is similar to the opteron from
>>>>>>chessprogramming viewpoint.
>>>>>>
>>>>>>Of course the Xeons have bigger L2 or even L3 caches on chip than AMD. That's
>>>>>>nice for certain applications that are in benchmarks, but in reallife it's not a
>>>>>>huge advantage.
>>>>>>
>>>>>>A few MB's is plenty for computerchess at the moment.
>>>>>>
>>>>>>On the other hand, could you tell me whether this Xeon has an on die memory
>>>>>>controller or doesn't it have one?
>>>>>>
>>>>>>Because *that* matters a lot. Hashtables is a matter of TLB trashing memory
>>>>>>latencies to a big hashtable. With 64 bits cpu's and the clock that keeps
>>>>>>ticking, the RAM sizes will increase too, meaning that the latencies you lose to
>>>>>>TLB trashing (transpositiontable , eval table, not so much pawntable as that'll
>>>>>>be in L2 cache for majority of accesses) are significant.
>>>>>>
>>>>>>If intel plans to do that via some sort of chipset off chip, then that is a huge
>>>>>>drawback of this Xeon cpu for databases and chess. At database benchmarks, using
>>>>>>some small database they can get away with a big L2/L3 then, but in real life
>>>>>>there is no escape there. It's just dead slow.
>>>>>>
>>>>>>So i do look forward to pentium-m, but the price at which intel usually sells
>>>>>>good cpu's doesn't mean that we will see more quads online.
>>>>>>
>>>>>>Vincent
>>>>>
>>>>>
>>>>>Yes, memory latency seems worse.
>>>>>
>>>>>OTOH intel has more than two times better bandwith using 128-bit SSE2/3
>>>>>load/store instructions, which is of course not so important for cumputer chess.
>>>>>
>>>>>Cache/Speicher: 128-Bit-Transfer
>>>>>Bandwidth in MByte/s
>>>>>
>>>>>           Dempsey Paxville Opteron 280
>>>>>L1          47340    41444    18360
>>>>>L2          24928    22105     9448
>>>>>Memory       3606     4127     3316
>>>>
>>>>That's of course just paper.
>>>>
>>>>First of all at a quad machine, 8 cores at intel must share 3GB memory
>>>>bandwidth, which is *theoretic* bandwidth.
>>>>
>>>>This where 8 cores at quad opteron have 4 memory controllers. So that's a factor
>>>>4 advantage to opteron there in memory bandwidth.
>>>>
>>>>I didn't read bandwidth specs from L1&L2 cache of the intel chips.
>>>>
>>>>May i remind you that they had similar big heaven predictions for the P4 in the
>>>>past. It would have a 2 cycle L1 cache bla bla.
>>>>
>>>>Prescott actually has a 4 cycle L1 cache.
>>>>
>>>>P4 would execute 4 instructions a cycle, because of having 2 doubled clocked
>>>>integer units.
>>>>
>>>>Its practical limitations actually limit it to 3 instructions a cycle, and
>>>>nearly no one can get that, thanks to other limitations.
>>>>
>>>>We should all use CMOV constructs says intel, to avoid branch mispredictions.
>>>>
>>>>Actually their own compiler doesn't generate them when using P4 switches,
>>>>because at prescott a CMOV is at least 7 cycles penalty, versus 2 for AMD.
>>>>
>>>>So you can quote anything on paper here. The reality can be expressed in money
>>>>very easily.
>>>>
>>>>That's that those Xeon chips can never compete in terms of price against quad
>>>>core opterons, which will be on the market long before the DUAL core Xeon is
>>>>there.
>>>>
>>>>How can 4 cores of AMD ever be slower than 2 from intel.
>>>>
>>>>If you plan to stream for example SSE2 to processors executing all kind of code,
>>>>then obviously 4 cores of AMD always will win from 2 cores of intel.
>>>>
>>>>Especially if the AMD ones can run already for months when the intels still are
>>>>in the factory on a paper sheet.
>>>>
>>>>Please realize in terms of bandwidth for gflop calculations that memory is the
>>>>bottleneck. If 4 cpu's (8 cores) from intel can get at most 3 gigabyte a second,
>>>>then obviously AMD will always win when they can stream 12 gigabyte a second to
>>>>it.
>>>>
>>>>When on paper intel can receive 3.6GB and AMD on paper can receive 3.3GB a
>>>>second, that's not real relevant.
>>>>
>>>>It's 1 memory controller for Intel, versus 4 for AMD.
>>>
>>>
>>>That is if you have four processors.  But the dual cores are sharing one
>>>controller, and the dual cores most definitely compete for that one
>>>hypertransport interface also since the multiple cache controllers and processor
>>>cores place a high demand on a single path (per chip, not per core) to memory.
>>>And then there is the issue of NUMA memory, which also reduces that high
>>>theoretical AMD bandwidth significantly...
>>>
>>>The dual-core chips are _not_ as good as two single-core chips, based on lots of
>>>benchmarking.  They are very good, don't get me wrong, but the shared
>>>hypertransport means 2x the traffic thru one external interface, which can and
>>>does produce a bottleneck..
>>>
>>
>>Thanks for clearing this up Robert.
>
>It's dead wrong of course.

It isn't "dead wrong".

AMD is aware of "issues" with dual cores with dual caches, with one HT
interface.  The quad-cores will have a larger issue.


>
>hypertransport isn't the bottleneck at all as that delivers hands down 14.4GB/s
>a channel.

There is more to this than "bandwidth".  There is "latency" and there are
"conflicts"

>
>The latency from the dual cores is a lot worse however. If TLB trashing latency
>is a problem for your software, then dual cores will not give a precise 2 fold
>speedup but more like 1.90-1.99 or so.
>
>In the scaling graph from diep, based upon 200+ positions and a statistical
>significance (95%) the scaling of diep at a quad opteron dual core was 7.475

That was almost exactly what I got for Crafty for WCCC-level settings.  I
actually produced 8.0X (or very close to it) NPS numbers, but those settings
internally were less efficient from the parallel search overhead and I didn't
play using them.

>
>That gives a loss a cpu of less than 7%.
>
>However when simply running quad there is already a loss.

I got almost perfect scaling on 4-way and 8-way single core systems.  I had some
interesting issues to work around for dual core boxes..




>
>I argue that the loss of a quad dual core is less or equal to a plain 8 fold
>single core machine.

That I know is not true as I have run on both.  The 8 single cores is measurably
better, but again, Crafty was my only benchmark.




>
>Please keep in mind that a latency of 234 nanoseconds at a quad dual core (when
>all 8 cores simultaneously ask 8 bytes from a random spot out of a 2GB buffer)
>is still a lot better than at a 8 fold intel Xeon which has a latency of far
>over 700 ns up to 900 ns for 2GB.


I'm not going to debate intel again.  AMD can be significantly worse than 234.
Their page tables require 4 memory reads when a TLB miss occurs.  Intel only
needs 2.  And if you use large pages, both trim one memory read off that which
changes the numbers quite a bit.  But on AMD, with TLB thrashing, you are going
to do 5 reads every time to translate the virtual to a real address and then
read the actual data.  With big pages, that becomes 4.  With Intel it becomes 2.
 Those are not that far apart overall.

Problem with Intel will be the non-NUMA memory that requires interleaving, and I
doubt they are going beyond 4-way interleaving, leaving 8 cpus to starve for
data...

Also when you start sharing data, those remote caches become yet another
bottleneck, because they are all talking to each other very frequently to handle
the "owner state" in MOESI.  This can become another issue if great care is not
taken in sharing data responsibly...



>
>Vincent
>
>>>
>>>>
>>>>Now for games that are multithreaded and SSE2 calculations like in all kind of
>>>>graphics and such, that memory perfomance is a big performance hit.
>>>>
>>>>Additional, bandwidth in L1 cache for chess will be dominated by the LATENCY
>>>>that getting a single doubleword out of L1 eats and the number of reads you can
>>>>do simultaneously there.
>>>>
>>>>I remember the optimistic specs from the past from intel. They were not true.
>>>>
>>>>What will be the achillesheels this time?
>>>>
>>>>If there isn't, it's a killer cpu in that case for software that doesn't need
>>>>RAM!
>>>>
>>>>If there is again achillesheels, intel has a major problem then.
>>>>
>>>>But i do realize the price of those cpu's. Just look to the size of the L2
>>>>cache!
>>>>
>>>>What was it 16MB or something?
>>>>
>>>>That's not gonna be CHEAP.
>>>
>>>Would not speculate there.  As FAB sizes go down, transistor count goes up, with
>>>no increase in cost at all.  It used to be "how can we squeeze all this stuff
>>>(L1/L2/floating point/multiple pipes/etc) into this small number of
>>>transistors?"  It is now more of "what on earth can we use all these transistors
>>>for.  At 6 transistors per bit for SRAM, a megabyte requires 6M transistors,
>>>which is chickenfeed...
>>>
>>>
>>>
>>>>
>>>>So whatever its performance, it won't be able to compete against AMD in that
>>>>sense.
>>>>
>>>>You wonder about SSE2 here. Well let me ask you, how many SSE2 execution units
>>>>does it have?
>>>>
>>>>We know AMD has 2.
>>>>P4 has 1.
>>>>
>>>>AMD completely outperforms P4 there.
>>>>
>>>>Why would this be different at a pentium-m at stereoids, can you give some
>>>>explanations?
>>>>
>>>>Let me give counter arguments.
>>>>
>>>>a) it will have an utmost TINY L1 cache
>>>>b) it will SHARE the L2 cache, so it has a DEAD SLOW L2 cache in terms of
>>>>latency.
>>>
>>>Shared L2 is not necessarily bad.  On AMD the MOESI traffic can get _very_ high
>>>if the two processor cores are modifying data that is shared...
>>>
>>>
>>>>c) intel has the habit to try to get away with a very cheap L1 cache too, and
>>>>just make 1 port in it. AMD had aready at K7 2 ports and so has K8.
>>>>
>>>>What will intel do this time to keep this cpu a cheap cpu to produce, meanwhile
>>>>asking golden coins when you buy it?
>>>>
>>>>Who knows, perhaps intel has some good cpu now?
>>>>
>>>>Let's hope so.
>>>>
>>>>>Also general SSE-performance is much better for the future intels.
>>>>>Hopefully some motivation for amd to work on 128-bit alus ;-)
>>>>>Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.