Author: Robert Hyatt
Date: 12:26:23 11/12/03
Go up one level in this thread
On November 12, 2003 at 13:46:21, Matthew Hull wrote: >On November 12, 2003 at 13:34:09, Robert Hyatt wrote: > >>On November 12, 2003 at 13:18:48, Matthew Hull wrote: >> >>>On November 12, 2003 at 12:18:22, Anthony Cozzie wrote: >>> >>>>On November 12, 2003 at 11:55:20, Gian-Carlo Pascutto wrote: >>>> >>>>>On November 11, 2003 at 23:42:45, Eugene Nalimov wrote: >>>>> >>>>>>My point is: it's possible that due to the fact that quad Opteron is NUMA -- >not SMP -- system, for SMP-only program performance on quad Opteron can be >>>>>>worse than on *real* quad SMP system, even when for one CPU Opteron >>>>>>performance is much better. Itanium was used only as an example of such >>>>>>system, I never recommended rewriting any program for it. >>>>> >>>>>I don't understand how. The NUMA part is RAM. Even worst case on the Opteron >>>>>RAM is faster than Xeon SMP. So how could it ever be worse? >>>>> >>>>>-- >>>>>GCP >>>> >>>>Aaron's argument is: if a 1x opteron is faster than a 1x Xeon, a 4x opteron will >>>>be faster than a 4x Xeon. >>>> >>>>Nalimov is saying that Fritz may scale worse on the opteron due to NUMA issues. >>>>In other words, this is comparing latency with 1x opteron and NUMA opteron >>>>relative to 1x Xeon vs SMP Xeon. >>>> >>>>Off hand this seems logical to me . . . >>> >>>Perhaps Eugene can tell us if SMP crafty was slower on 2x opteron than Bob's 2x >>>Xeon, before the NUMA mods were made? >>> >>>MH >> >>Yes. It was _really_ bad on the opteron. But then again it was also not >>real good on my xeon. Even though the NPS scaled _perfectly_ on my older >>quad xeons. The PIV went to a longer cache line, which caused some coherency >>overhead that hurt. This has been addressed in the current code. But the >>problem was worse on the opteron due to the NUMA delays, compared to the >>PIV xeons which simply have a longer cache line to aggravate the problem. > > >Thanks for the clarification. I had wondered since GCP has been implying good >speedup on Opteron without bothering to design to NUMA. I think he uses the "heavyweight process" approach, where the 4 processes don't share anything except for stuff explicitly shared through a block of "system V shared memory". That solves one problem, although doing a single shmget() will result in all of that shared memory being stuck on a single processor's local memory, which is not going to be very good. And with heavyweight processes, you lose the ability to use Eugene's EGTB probe code's LRU buffers for all threads. you end up with 4 sets of EGTB buffers, and each process could be reading the same blocks over and over where with threads, one read would share the results with all threads needing it. I prefer threads (lightweight processes). GCP/Vincent prefer heavyweight processes. There's give and take in both approaches. Threads are more complex due to sharing lots of stuff. But, IMHO, they are better overall from a performance perspective, when you consider the egtb stuff. On a good system, both share instructions due to the way processes are created, but at least for egtb stuff, threads are definitely better. Using fork() on a NUMA is really no better than threads, in that the instructions will still sit on a single processor's memory. But Eugene thinks the cache is big enough to make this unimportant. The best way (on NUMA) is probably to execute completely different executable file names, so that there is no sharing and instructions get stuck in memory on the processor that runs the process. That is a bit more complicated and until I see that it is really necessary, I'm not going that way... > >While on the subject, were the older Mac G4 duals SMP or NUMA? I think the new >G5 duals must be SMP, since they are based on IBMs MCP (multi-chip module) >technology (I think). I really don't know, but I suspect that for duals, they are SMP. A dual NUMA doesn't make a lot of sense, unless you are like AMD and have a single chip you want to scale from 1 to 2 to 4 to 8 to .... NUMA scales better, and if you have a NUMA 4-way box, a NUMA 2-way is cheap to do. > >Also, are Terradate machines NUMA or just clusters? They have had massively >parrallel machines since 286 days, which makes me think they are cluster >technology. > I really don't know. But if you talk about 256 and up CPUs, they are almost always a cluster, although they might have something much better than 100mbit ethernet to connect them. But they are surely message-passing rather than shared memory. >Thanks, >Matt > > >> >> >> >> >>> >>>> >>>>anthony
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.