Author: Robert Hyatt
Date: 18:45:49 08/25/02
Go up one level in this thread
On August 25, 2002 at 12:36:52, Vincent Diepeveen wrote: >On August 25, 2002 at 10:09:33, Dan Andersson wrote: > >>> >>>The multiprocessor versions of the Hammer will be what most guess it is. >>>It is more like a NUMA system. Getting data from local memory is very >>>fast (faster than RAM is now), but on the other hand getting data from >>>other nodes is a lot slower. >>> >>Not a lot slower actually. And factoring in the faster memory speeds you will > >Factor 4 or so at 8 processors i guess. > >If you don't find that a lot slower, discussion is ended. > >>not be hit by it at all. Since the memory controllers and the communication >>ports are connected to the CPU core by the same type of bus. And the latency >>will be lower the faster the CPU goes. Down to an asymtotic constant limit. So > >I didn't pick up enough from hardware to understand why AMD is claiming >something like this for hammer. Sounds big BS to me personally unless >they invented a new wheel. > >>you need *no* complete redesign to run Crafty fast. No redesing at all, > >Fast is a flexibel word Mr Andersson. > >For me the important thing is: "how many nodes a second >is a single cpu version getting" versus "what nps >does the 8 processor version get". > >if the difference is not a factor 8, then the software needs >a redesign obviously. > >In case of crafty bob already answerred in normal words what he feels >he must do. I would add to that implementation specific thought: >getting rid of smp_lock variable >as well, which prevents all the cpu's from splitting or doing a stopthread, >and if copying from P0 to Pn is real slow (of course also the lock() >is a lot slower on a NUMA design) then obviously that is exponentially >giving a problem if the number of cpu's rises. > >Also the issue of getting from multithreading to multiprocessing bob already >answerred, because it is of course not so smart to spend 2000 clocks just >to get a cache line from another cpu which holds the move generation >tables. All that must get done local of course (so easiest thing is >to fork() crafty a few times and then share the hashtable and the 'tree' >datastructure). This may or may not be a "win". On most numa machines, you can't even do a "fork()" operation. You have to actually start the process on each node in a different way, because they don't really have a fully-shared memory architecture... But I don't particularly consider that a problem. I _really_ don't think you want a large shared hash table either. I think it will be better performance-wise to have many smaller local hash tables with some sort of message-passing lookup facility... > >Technical spoken of course you can rewrite multithreading to a point that >it is more looking like multiprocessing and vice versa. There is a big >gray area there and the important thing to realize is that in case >of crafty it is really no big problem to do that. > >Getting rid of smp_lock is however, because if 4 processors at the same >time call the function StopThread() then you add zillions of race conditions >to the program which are not there now (at least not provable). There are solutions to this. Although with 8 and 16 cpus the current lock approach in crafty is not a performance bottleneck, based on a lot of GPSS simulations several years back. Even for 32, the overhead is bearable. Beyond that, it does become more of an issue... > >I hope you realize that rewriting crafty from what it is now to something >that is 8 times faster at 8 NUMA cpu's is a lot more work than just a >recompile. I estimate it at 2 months of fulltime work from a skilled >person, at least that's what i needed to port diep from a multiprocessor >version (also having like crafty a smp_lock) to something that works >great on a SGI NUMA now. > >NUMA is obviously the software design to keep in mind for the future. >You don't need to get slowed down at all by making use of numa a look >like architecture, it's simply a different way of design. > >Doesn't take away that getting memory from a remote node is very slow. > >>actually. But to get the absolute maximum you need to factor in the hardware. > > > >>MvH Dan Andersson
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.