Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Who can update about new 64 bits chip?

Author: Robert Hyatt
Date: 18:45:49 08/25/02
On August 25, 2002 at 12:36:52, Vincent Diepeveen wrote:

>On August 25, 2002 at 10:09:33, Dan Andersson wrote:
>
>>>
>>>The multiprocessor versions of the Hammer will be what most guess it is.
>>>It is more like a NUMA system. Getting data from local memory is very
>>>fast (faster than RAM is now), but on the other hand getting data from
>>>other nodes is a lot slower.
>>>
>>Not a lot slower actually. And factoring in the faster memory speeds you will
>
>Factor 4 or so at 8 processors i guess.
>
>If you don't find that a lot slower, discussion is ended.
>
>>not be hit by it at all. Since the memory controllers and the communication
>>ports are connected to the CPU core by the same type of bus. And the latency
>>will be lower the faster the CPU goes. Down to an asymtotic constant limit. So
>
>I didn't pick up enough from hardware to understand why AMD is claiming
>something like this for hammer. Sounds big BS to me personally unless
>they invented a new wheel.
>
>>you need *no* complete redesign to run Crafty fast. No redesing at all,
>
>Fast is a flexibel word Mr Andersson.
>
>For me the important thing is: "how many nodes a second
>is a single cpu version getting" versus "what nps
>does the 8 processor version get".
>
>if the difference is not a factor 8, then the software needs
>a redesign obviously.
>
>In case of crafty bob already answerred in normal words what he feels
>he must do. I would add to that implementation specific thought:
>getting rid of smp_lock variable
>as well, which prevents all the cpu's from splitting or doing a stopthread,
>and if copying from P0 to Pn is real slow (of course also the lock()
>is a lot slower on a NUMA design) then obviously that is exponentially
>giving a problem if the number of cpu's rises.
>
>Also the issue of getting from multithreading to multiprocessing bob already
>answerred, because it is of course not so smart to spend 2000 clocks just
>to get a cache line from another cpu which holds the move generation
>tables. All that must get done local of course (so easiest thing is
>to fork() crafty a few times and then share the hashtable and the 'tree'
>datastructure).

This may or may not be a "win".  On most numa machines, you can't even do
a "fork()" operation.  You have to actually start the process on each node
in a different way, because they don't really have a fully-shared memory
architecture...  But I don't particularly consider that a problem.  I
_really_ don't think you want a large shared hash table either.  I think
it will be better performance-wise to have many smaller local hash tables
with some sort of message-passing lookup facility...



>
>Technical spoken of course you can rewrite multithreading to a point that
>it is more looking like multiprocessing and vice versa. There is a big
>gray area there and the important thing to realize is that in case
>of crafty it is really no big problem to do that.
>
>Getting rid of smp_lock is however, because if 4 processors at the same
>time call the function StopThread() then you add zillions of race conditions
>to the program which are not there now (at least not provable).

There are solutions to this.  Although with 8 and 16 cpus the current lock
approach in crafty is not a performance bottleneck, based on a lot of GPSS
simulations several years back.  Even for 32, the overhead is bearable.  Beyond
that, it does become more of an issue...


>
>I hope you realize that rewriting crafty from what it is now to something
>that is 8 times faster at 8 NUMA cpu's is a lot more work than just a
>recompile. I estimate it at 2 months of fulltime work from a skilled
>person, at least that's what i needed to port diep from a multiprocessor
>version (also having like crafty a smp_lock) to something that works
>great on a SGI NUMA now.
>
>NUMA is obviously the software design to keep in mind for the future.
>You don't need to get slowed down at all by making use of numa a look
>like architecture, it's simply a different way of design.
>
>Doesn't take away that getting memory from a remote node is very slow.
>
>>actually. But to get the absolute maximum you need to factor in the hardware.
>
>
>
>>MvH Dan Andersson
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.