Author: Robert Hyatt
Date: 07:57:31 09/04/03
Go up one level in this thread
On September 04, 2003 at 05:55:08, Mridul Muralidharan wrote: >Hi, > > My comments inline. > >Regards >Mridul > >On September 03, 2003 at 12:04:01, Robert Hyatt wrote: > >>On September 03, 2003 at 05:13:07, Mridul Muralidharan wrote: >> ><snip> >>>If it was just a bunch of tweaks that you mention here - I would love to see how >>>much performance it will give on a 64/128 proc NUMA box :) >>>I can make a guess - it will suck a**. (No offence to anyone here) >> >>You are mixing apples and oranges. How will it do on a 128 node SMP >>box? How will it do on a 128 node NUMA box? _both_ will not do very well >>since things are not tuned for that many processors. However, the original >>NUMA port did pretty well on a 32 CPU box. Not as well as it would have done >>on a 32 CPU SMP box however. But then NUMA won't _ever_ produce the same >>level of performance as pure SMP boxes will. They are just much more >>affordable. >> > > If you properly design and implement a version of crafty _for_ NUMA and just >"tweak" a current version for NUMA (that was orginally written for SMP ?!) - >then do you expect both to give out same or even comparable performance ?!! Yes, that's the way I write code. Here's a list of things I did for the Compaq port. They address all the NUMA issues: 1. no threads. Can't afford to share code or global data that doesn't change (IE attack tables for bitmaps, etc.) This could be done on the current version easily by changing to fork() rather than clone() on linux. Or by simply starting separate processes manually or via something other than fork() (rsh or whatever). 2. local split blocks. This is the critical problem. A split block contains all data needed for a search, including chess board, hash stuff, repetition list, etc. That _must_ be local to the processor that is using it to do the search. And this was the biggest change I had to make. It wasn't a difficult change, as I used a global array of pointers to split blocks, but each processor was responsible for malloc()'ing its own split blocks so that they were local. The global array of pointers let other processors copy data to a split block that was local to the processor that was going to help. This was not a major project, but there were some "issues" from allocating a split block on the right processor, to trying to keep "groups" of processors working together if they were "close" in terms of NUMA access latency. 3. Transposition table. Locks are expensive. Tim and I ended up doing what Harry Nelson had done for Cray Blitz, the "lockless approach", but only after the Compaq project was over. This was an "issue" that needed attention and today it has been solved, where on the NUMA compaq box it was not so clean. Sharing the thing invites trouble. Compaq had some library stuff that allows you to pre-load things to avoid the access latency, and this helped some. But with a lock, there is nothing to do to avoid the problems there since the lock is shared across a set of routers. 4. Sharing important stuff. This has to be handled differently if you use separate processes, but it is mainly some messy programming rather than innovative development. >Ofcourse - I never said crafty wont work - it will - but the performance will be >pathetic as compared to the NUMA version. The modified version _would_ be the NUMA version in my case. >Most likely the performance on a 64 proc NUMA box will turn out to be better >than a 128 proc NUMA box !! Depends. This might be a non-NUMA issue however. IE using a large number of processors has its own set of issues, whether the underlying platform is NUMA or not. But remember, I am _not_ new to NUMA. I ran programs on the first connection machine 20 years ago (not chess). >Hence - to get this working properly on NUMA boxes (not itsy bitsy 8 or 16 proc >machines - though 16 proc box would be pretty cool to own ;) ) then you _do_ >need a redesign and reimplementation - not a bunch of tweaks. >When I say redesign/reimplementation - I'm only refering to search , and mem >management : that is what is usually required. I agree. I just disagree on the size of the change. Crafty has some things _already_ in it for such architectures in general. > ><snip> >>>
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.