Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Is Xeon 5000 (Dempsey) with FB-DIMMs faster then Opteron 280 ?

Author: Robert Hyatt

Date: 20:10:43 11/10/05

Go up one level in this thread


On November 10, 2005 at 20:36:22, enrico carrisco wrote:

>
>Explain the "issues" -- how does this affect computer chess, whatsoever?  Crafty
>(and any other program) would be lucky to use even 1/3 of the HT pipe.
>

Very optimistic guess.  Here's why.  Take my quad-cpu dual core WCCC event
machine.  You get 4 HT gateways, one per processor (not one per core).  I ended
up with 4 nodes, each node having one HT and dual CPU cores and dual L1/L2
caches.

The nodes are connected in a "box".  Node 0 connected to node 1 and 2, nodes 1
and 2 connected to node 0 and 3.  Memory accesses from node 0 go thru node 1 or
2 before getting delivered to node 3.  Also cache activity.  I can explain MOESI
if you like, but its idea is to avoid the classic "invalidate-write" Intel used
to use, so that when CPU 1 reads data that is cached in CPU 2 and the data has
already been modified by CPU 2 but has not been written back to memory, the old
approach would cause CPU 2 to mark the cache block as invalid and write it back
to memory.  Then CPU 1 would read the entire block from memory which resulted in
2X memory traffic for that one memory read.  MOESI lets the cache for CPU2
directly send the block to CPU1 without doing the memory write.  But now we have
a serious coherency issue, and the MOESI added the "owner" state to make sure
that everyone knows who has the most recently modified block of cache for a
given memory address, so that it gets updates in real memory when appropriate.
And that kind of protocol results in lots of "cache-to-cache" message passing.
Over, you guessed it, the HT bus.  So if you modify shared data, you can cause a
_lot_ of cache-to-cache traffic that gets interspersed with the normal
node-to-remote-memory traffic.  It is quite easy to totally bog the HT system
down.  That was what my two weeks of frantic tuning prior to the WCCC was all
about.  Initially 8 cpus were no faster NPS-wise then 2 or 3 real processors...





>>
>>
>>>
>>>hypertransport isn't the bottleneck at all as that delivers hands down 14.4GB/s
>>>a channel.
>>
>>There is more to this than "bandwidth".  There is "latency" and there are
>>"conflicts"
>>
>>>
>>>The latency from the dual cores is a lot worse however. If TLB trashing latency
>>>is a problem for your software, then dual cores will not give a precise 2 fold
>>>speedup but more like 1.90-1.99 or so.
>>>
>>>In the scaling graph from diep, based upon 200+ positions and a statistical
>>>significance (95%) the scaling of diep at a quad opteron dual core was 7.475
>>
>>That was almost exactly what I got for Crafty for WCCC-level settings.  I
>>actually produced 8.0X (or very close to it) NPS numbers, but those settings
>>internally were less efficient from the parallel search overhead and I didn't
>>play using them.
>>
>>>
>>>That gives a loss a cpu of less than 7%.
>>>
>>>However when simply running quad there is already a loss.
>>
>>I got almost perfect scaling on 4-way and 8-way single core systems.  I had some
>>interesting issues to work around for dual core boxes..
>
>Again, what "issues"?  I agree that sharing the HT could become an issue with
>"certain applications" -- but not computer chess.  The only other application I
>could think of is some of the intense gaming software but, to my knowledge, none
>of those are even SMP capable yet.


Computer chess can produce a _lot_ of memory accesses, which turns into a lot of
cache accesses, which turns into a lot of cache-to-cache traffic on a SMP box.
A _lot_ of traffic.  It can be reduced with some programming effort.  I did it.
But it is easy to "heat it up" nicely...  And performance suffers.

Note that this is simply a result of NUMA, because the design is intended to
provide better price scaling at the expense of performance.  Otherwise it would
be truly SMP with uniform memory accesses, but for 8 or 16 cpus it would be
prohibitively expensive, if based on a non-blocking crossbar as in the cray
vector machines...






This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.