Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Intel four-way 2.8 Ghz system is just Amazing ! - Not hardly

Author: Robert Hyatt
Date: 12:26:23 11/12/03
On November 12, 2003 at 13:46:21, Matthew Hull wrote:

>On November 12, 2003 at 13:34:09, Robert Hyatt wrote:
>
>>On November 12, 2003 at 13:18:48, Matthew Hull wrote:
>>
>>>On November 12, 2003 at 12:18:22, Anthony Cozzie wrote:
>>>
>>>>On November 12, 2003 at 11:55:20, Gian-Carlo Pascutto wrote:
>>>>
>>>>>On November 11, 2003 at 23:42:45, Eugene Nalimov wrote:
>>>>>
>>>>>>My point is: it's possible that due to the fact that quad Opteron is NUMA -- >not SMP -- system, for SMP-only program performance on quad Opteron can be
>>>>>>worse than on *real* quad SMP system, even when for one CPU Opteron
>>>>>>performance is much better. Itanium was used only as an example of such
>>>>>>system, I never recommended rewriting any program for it.
>>>>>
>>>>>I don't understand how. The NUMA part is RAM. Even worst case on the Opteron
>>>>>RAM is faster than Xeon SMP. So how could it ever be worse?
>>>>>
>>>>>--
>>>>>GCP
>>>>
>>>>Aaron's argument is: if a 1x opteron is faster than a 1x Xeon, a 4x opteron will
>>>>be faster than a 4x Xeon.
>>>>
>>>>Nalimov is saying that Fritz may scale worse on the opteron due to NUMA issues.
>>>>In other words, this is comparing latency with 1x opteron and NUMA opteron
>>>>relative to 1x Xeon vs SMP Xeon.
>>>>
>>>>Off hand this seems logical to me . . .
>>>
>>>Perhaps Eugene can tell us if SMP crafty was slower on 2x opteron than Bob's 2x
>>>Xeon, before the NUMA mods were made?
>>>
>>>MH
>>
>>Yes.  It was _really_ bad on the opteron.  But then again it was also not
>>real good on my xeon.  Even though the NPS scaled _perfectly_ on my older
>>quad xeons.  The PIV went to a longer cache line, which caused some coherency
>>overhead that hurt.  This has been addressed in the current code.  But the
>>problem was worse on the opteron due to the NUMA delays, compared to the
>>PIV xeons which simply have a longer cache line to aggravate the problem.
>
>
>Thanks for the clarification.  I had wondered since GCP has been implying good
>speedup on Opteron without bothering to design to NUMA.

I think he uses the "heavyweight process" approach, where the 4 processes
don't share anything except for stuff explicitly shared through a block
of "system V shared memory".  That solves one problem, although doing a
single shmget() will result in all of that shared memory being stuck on
a single processor's local memory, which is not going to be very good.

And with heavyweight processes, you lose the ability to use Eugene's EGTB
probe code's LRU buffers for all threads.  you end up with 4 sets of EGTB
buffers, and each process could be reading the same blocks over and over
where with threads, one read would share the results with all threads
needing it.

I prefer threads (lightweight processes).  GCP/Vincent prefer heavyweight
processes.  There's give and take in both approaches.  Threads are more
complex due to sharing lots of stuff.  But, IMHO, they are better overall
from a performance perspective, when you consider the egtb stuff.  On a good
system, both share instructions due to the way processes are created, but
at least for egtb stuff, threads are definitely better.  Using fork() on a
NUMA is really no better than threads, in that the instructions will still
sit on a single processor's memory.  But Eugene thinks the cache is big enough
to make this unimportant.  The best way (on NUMA) is probably to execute
completely different executable file names, so that there is no sharing and
instructions get stuck in memory on the processor that runs the process.  That
is a bit more complicated and until I see that it is really necessary, I'm not
going that way...

>
>While on the subject, were the older Mac G4 duals SMP or NUMA?  I think the new
>G5 duals must be SMP, since they are based on IBMs MCP (multi-chip module)
>technology (I think).

I really don't know, but I suspect that for duals, they are SMP.  A dual NUMA
doesn't make a lot of sense, unless you are like AMD and have a single chip
you want to scale from 1 to 2 to 4 to 8 to ....  NUMA scales better, and if you
have a NUMA 4-way box, a NUMA 2-way is cheap to do.

>
>Also, are Terradate machines NUMA or just clusters?  They have had massively
>parrallel machines since 286 days, which makes me think they are cluster
>technology.
>


I really don't know.  But if you talk about 256 and up CPUs, they are
almost always a cluster, although they might have something much better
than 100mbit ethernet to connect them.  But they are surely message-passing
rather than shared memory.





>Thanks,
>Matt
>
>
>>
>>
>>
>>
>>>
>>>>
>>>>anthony
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.