Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Is Xeon 5000 (Dempsey) with FB-DIMMs faster then Opteron 280 ?

Author: enrico carrisco

Date: 17:36:22 11/10/05

Go up one level in this thread


On November 10, 2005 at 00:28:24, Robert Hyatt wrote:

>On November 09, 2005 at 21:14:46, Vincent Diepeveen wrote:
>
>>On November 09, 2005 at 20:09:36, gerold daniels wrote:
>>
>>>On November 09, 2005 at 13:48:13, Robert Hyatt wrote:
>>>
>>>>On November 08, 2005 at 22:25:15, Vincent Diepeveen wrote:
>>>>
>>>>>On November 08, 2005 at 16:23:22, Gerd Isenberg wrote:
>>>>>
>>>>>>On November 08, 2005 at 15:26:45, Vincent Diepeveen wrote:
>>>>>>
>>>>>>>On November 08, 2005 at 13:17:42, Yar wrote:
>>>>>>>
>>>>>>>>Hello,
>>>>>>>>
>>>>>>>>Here is review (total 14 pages) of upcoming Intel's Xeon 5000 (Dempsey). Sorry
>>>>>>>>its only in german. It seems its faster then Opeteron 280.
>>>>>>>>http://www.tecchannel.de/server/hardware/432957/
>>>>>>>>
>>>>>>>>With best regards,
>>>>>>>>
>>>>>>>>Yar
>>>>>>>
>>>>>>>It should be a fast cpu that Dempsey. However that Xeon will be there januari
>>>>>>>2007 or so and it will have a price of i guess around 5000 euro a cpu in the
>>>>>>>quad version, if you can get it for that, as you'll have to buy probably
>>>>>>>1000 at a time to get them for around 4500 dollar a piece.
>>>>>>>
>>>>>>>So effectively a quad xeon dual core will be januari 2007 around $40k.
>>>>>>>
>>>>>>>By that time of course a quad opteron quad core is nearly 2 times faster
>>>>>>>and exactly 2 times cheaper.
>>>>>>>
>>>>>>>Please note that it's not sure whether the IPC from the intel pentium-m at such
>>>>>>>high clockspeeds and dual core will be better than from AMD. I'm counting at it
>>>>>>>that it will be a lot slower, because in order to clock pentium-m higher, intel
>>>>>>>will need to make the pipeline longer and will probably  move from a 2 cycle L1
>>>>>>>to a 3 cycle L1. In which case the processor is similar to the opteron from
>>>>>>>chessprogramming viewpoint.
>>>>>>>
>>>>>>>Of course the Xeons have bigger L2 or even L3 caches on chip than AMD. That's
>>>>>>>nice for certain applications that are in benchmarks, but in reallife it's not a
>>>>>>>huge advantage.
>>>>>>>
>>>>>>>A few MB's is plenty for computerchess at the moment.
>>>>>>>
>>>>>>>On the other hand, could you tell me whether this Xeon has an on die memory
>>>>>>>controller or doesn't it have one?
>>>>>>>
>>>>>>>Because *that* matters a lot. Hashtables is a matter of TLB trashing memory
>>>>>>>latencies to a big hashtable. With 64 bits cpu's and the clock that keeps
>>>>>>>ticking, the RAM sizes will increase too, meaning that the latencies you lose to
>>>>>>>TLB trashing (transpositiontable , eval table, not so much pawntable as that'll
>>>>>>>be in L2 cache for majority of accesses) are significant.
>>>>>>>
>>>>>>>If intel plans to do that via some sort of chipset off chip, then that is a huge
>>>>>>>drawback of this Xeon cpu for databases and chess. At database benchmarks, using
>>>>>>>some small database they can get away with a big L2/L3 then, but in real life
>>>>>>>there is no escape there. It's just dead slow.
>>>>>>>
>>>>>>>So i do look forward to pentium-m, but the price at which intel usually sells
>>>>>>>good cpu's doesn't mean that we will see more quads online.
>>>>>>>
>>>>>>>Vincent
>>>>>>
>>>>>>
>>>>>>Yes, memory latency seems worse.
>>>>>>
>>>>>>OTOH intel has more than two times better bandwith using 128-bit SSE2/3
>>>>>>load/store instructions, which is of course not so important for cumputer chess.
>>>>>>
>>>>>>Cache/Speicher: 128-Bit-Transfer
>>>>>>Bandwidth in MByte/s
>>>>>>
>>>>>>           Dempsey Paxville Opteron 280
>>>>>>L1          47340    41444    18360
>>>>>>L2          24928    22105     9448
>>>>>>Memory       3606     4127     3316
>>>>>
>>>>>That's of course just paper.
>>>>>
>>>>>First of all at a quad machine, 8 cores at intel must share 3GB memory
>>>>>bandwidth, which is *theoretic* bandwidth.
>>>>>
>>>>>This where 8 cores at quad opteron have 4 memory controllers. So that's a factor
>>>>>4 advantage to opteron there in memory bandwidth.
>>>>>
>>>>>I didn't read bandwidth specs from L1&L2 cache of the intel chips.
>>>>>
>>>>>May i remind you that they had similar big heaven predictions for the P4 in the
>>>>>past. It would have a 2 cycle L1 cache bla bla.
>>>>>
>>>>>Prescott actually has a 4 cycle L1 cache.
>>>>>
>>>>>P4 would execute 4 instructions a cycle, because of having 2 doubled clocked
>>>>>integer units.
>>>>>
>>>>>Its practical limitations actually limit it to 3 instructions a cycle, and
>>>>>nearly no one can get that, thanks to other limitations.
>>>>>
>>>>>We should all use CMOV constructs says intel, to avoid branch mispredictions.
>>>>>
>>>>>Actually their own compiler doesn't generate them when using P4 switches,
>>>>>because at prescott a CMOV is at least 7 cycles penalty, versus 2 for AMD.
>>>>>
>>>>>So you can quote anything on paper here. The reality can be expressed in money
>>>>>very easily.
>>>>>
>>>>>That's that those Xeon chips can never compete in terms of price against quad
>>>>>core opterons, which will be on the market long before the DUAL core Xeon is
>>>>>there.
>>>>>
>>>>>How can 4 cores of AMD ever be slower than 2 from intel.
>>>>>
>>>>>If you plan to stream for example SSE2 to processors executing all kind of code,
>>>>>then obviously 4 cores of AMD always will win from 2 cores of intel.
>>>>>
>>>>>Especially if the AMD ones can run already for months when the intels still are
>>>>>in the factory on a paper sheet.
>>>>>
>>>>>Please realize in terms of bandwidth for gflop calculations that memory is the
>>>>>bottleneck. If 4 cpu's (8 cores) from intel can get at most 3 gigabyte a second,
>>>>>then obviously AMD will always win when they can stream 12 gigabyte a second to
>>>>>it.
>>>>>
>>>>>When on paper intel can receive 3.6GB and AMD on paper can receive 3.3GB a
>>>>>second, that's not real relevant.
>>>>>
>>>>>It's 1 memory controller for Intel, versus 4 for AMD.
>>>>
>>>>
>>>>That is if you have four processors.  But the dual cores are sharing one
>>>>controller, and the dual cores most definitely compete for that one
>>>>hypertransport interface also since the multiple cache controllers and processor
>>>>cores place a high demand on a single path (per chip, not per core) to memory.
>>>>And then there is the issue of NUMA memory, which also reduces that high
>>>>theoretical AMD bandwidth significantly...
>>>>
>>>>The dual-core chips are _not_ as good as two single-core chips, based on lots of
>>>>benchmarking.  They are very good, don't get me wrong, but the shared
>>>>hypertransport means 2x the traffic thru one external interface, which can and
>>>>does produce a bottleneck..
>>>>
>>>
>>>Thanks for clearing this up Robert.
>>
>>It's dead wrong of course.
>
>It isn't "dead wrong".
>
>AMD is aware of "issues" with dual cores with dual caches, with one HT
>interface.  The quad-cores will have a larger issue.

Explain the "issues" -- how does this affect computer chess, whatsoever?  Crafty
(and any other program) would be lucky to use even 1/3 of the HT pipe.

>
>
>>
>>hypertransport isn't the bottleneck at all as that delivers hands down 14.4GB/s
>>a channel.
>
>There is more to this than "bandwidth".  There is "latency" and there are
>"conflicts"
>
>>
>>The latency from the dual cores is a lot worse however. If TLB trashing latency
>>is a problem for your software, then dual cores will not give a precise 2 fold
>>speedup but more like 1.90-1.99 or so.
>>
>>In the scaling graph from diep, based upon 200+ positions and a statistical
>>significance (95%) the scaling of diep at a quad opteron dual core was 7.475
>
>That was almost exactly what I got for Crafty for WCCC-level settings.  I
>actually produced 8.0X (or very close to it) NPS numbers, but those settings
>internally were less efficient from the parallel search overhead and I didn't
>play using them.
>
>>
>>That gives a loss a cpu of less than 7%.
>>
>>However when simply running quad there is already a loss.
>
>I got almost perfect scaling on 4-way and 8-way single core systems.  I had some
>interesting issues to work around for dual core boxes..

Again, what "issues"?  I agree that sharing the HT could become an issue with
"certain applications" -- but not computer chess.  The only other application I
could think of is some of the intense gaming software but, to my knowledge, none
of those are even SMP capable yet.

-elc.

>
>
>
>
>>
>>I argue that the loss of a quad dual core is less or equal to a plain 8 fold
>>single core machine.
>
>That I know is not true as I have run on both.  The 8 single cores is measurably
>better, but again, Crafty was my only benchmark.
>
>
>
>
>>
>>Please keep in mind that a latency of 234 nanoseconds at a quad dual core (when
>>all 8 cores simultaneously ask 8 bytes from a random spot out of a 2GB buffer)
>>is still a lot better than at a 8 fold intel Xeon which has a latency of far
>>over 700 ns up to 900 ns for 2GB.
>
>
>I'm not going to debate intel again.  AMD can be significantly worse than 234.
>Their page tables require 4 memory reads when a TLB miss occurs.  Intel only
>needs 2.  And if you use large pages, both trim one memory read off that which
>changes the numbers quite a bit.  But on AMD, with TLB thrashing, you are going
>to do 5 reads every time to translate the virtual to a real address and then
>read the actual data.  With big pages, that becomes 4.  With Intel it becomes 2.
> Those are not that far apart overall.
>
>Problem with Intel will be the non-NUMA memory that requires interleaving, and I
>doubt they are going beyond 4-way interleaving, leaving 8 cpus to starve for
>data...
>
>Also when you start sharing data, those remote caches become yet another
>bottleneck, because they are all talking to each other very frequently to handle
>the "owner state" in MOESI.  This can become another issue if great care is not
>taken in sharing data responsibly...
>
>
>
>>
>>Vincent
>>
>>>>
>>>>>
>>>>>Now for games that are multithreaded and SSE2 calculations like in all kind of
>>>>>graphics and such, that memory perfomance is a big performance hit.
>>>>>
>>>>>Additional, bandwidth in L1 cache for chess will be dominated by the LATENCY
>>>>>that getting a single doubleword out of L1 eats and the number of reads you can
>>>>>do simultaneously there.
>>>>>
>>>>>I remember the optimistic specs from the past from intel. They were not true.
>>>>>
>>>>>What will be the achillesheels this time?
>>>>>
>>>>>If there isn't, it's a killer cpu in that case for software that doesn't need
>>>>>RAM!
>>>>>
>>>>>If there is again achillesheels, intel has a major problem then.
>>>>>
>>>>>But i do realize the price of those cpu's. Just look to the size of the L2
>>>>>cache!
>>>>>
>>>>>What was it 16MB or something?
>>>>>
>>>>>That's not gonna be CHEAP.
>>>>
>>>>Would not speculate there.  As FAB sizes go down, transistor count goes up, with
>>>>no increase in cost at all.  It used to be "how can we squeeze all this stuff
>>>>(L1/L2/floating point/multiple pipes/etc) into this small number of
>>>>transistors?"  It is now more of "what on earth can we use all these transistors
>>>>for.  At 6 transistors per bit for SRAM, a megabyte requires 6M transistors,
>>>>which is chickenfeed...
>>>>
>>>>
>>>>
>>>>>
>>>>>So whatever its performance, it won't be able to compete against AMD in that
>>>>>sense.
>>>>>
>>>>>You wonder about SSE2 here. Well let me ask you, how many SSE2 execution units
>>>>>does it have?
>>>>>
>>>>>We know AMD has 2.
>>>>>P4 has 1.
>>>>>
>>>>>AMD completely outperforms P4 there.
>>>>>
>>>>>Why would this be different at a pentium-m at stereoids, can you give some
>>>>>explanations?
>>>>>
>>>>>Let me give counter arguments.
>>>>>
>>>>>a) it will have an utmost TINY L1 cache
>>>>>b) it will SHARE the L2 cache, so it has a DEAD SLOW L2 cache in terms of
>>>>>latency.
>>>>
>>>>Shared L2 is not necessarily bad.  On AMD the MOESI traffic can get _very_ high
>>>>if the two processor cores are modifying data that is shared...
>>>>
>>>>
>>>>>c) intel has the habit to try to get away with a very cheap L1 cache too, and
>>>>>just make 1 port in it. AMD had aready at K7 2 ports and so has K8.
>>>>>
>>>>>What will intel do this time to keep this cpu a cheap cpu to produce, meanwhile
>>>>>asking golden coins when you buy it?
>>>>>
>>>>>Who knows, perhaps intel has some good cpu now?
>>>>>
>>>>>Let's hope so.
>>>>>
>>>>>>Also general SSE-performance is much better for the future intels.
>>>>>>Hopefully some motivation for amd to work on 128-bit alus ;-)
>>>>>>Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.