Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Dual Opteron 248 - recommended or not?

Author: Robert Hyatt
Date: 12:34:17 11/19/03
On November 19, 2003 at 12:54:30, Eugene Nalimov wrote:

>On November 19, 2003 at 12:15:23, Robert Hyatt wrote:
>
>>On November 18, 2003 at 16:06:43, Gian-Carlo Pascutto wrote:
>>
>>>On November 18, 2003 at 15:58:39, Russell Reagan wrote:
>>>
>>>>On November 18, 2003 at 15:22:51, Gian-Carlo Pascutto wrote:
>>>>
>>>>>Don't forget the Opteron is WAY faster running old 32 bit code too. You don't
>>>>>need 64 bit applications to take advantage of them.
>>>>
>>>>I don't know about that. I've seen a lot of benchmarks where the Opteron just
>>>>edges out the P4 (faster, but not WAY faster). I've also seen some where the P4
>>>>edges out the Opteron, but they are almost always in the same ballpark.
>>>>
>>>>The specint scores for a 2GHz Opteron and a 2GHz Athlon are not too far off
>>>>either. The Opteron scores were about 18-20% faster than the equivalently
>>>>clocked Athlon running Crafty, but the Opteron scores for 32-bit code were
>>>>compiled with the latest Intel C++ compiler (7 something), while the only >Althon 2GHz scores they have were using Intel C++ 5 something. I suspect if
>>>>both used the newer compiler, the difference would be less than 18%, which is
>>>>not WAY faster.
>>>
>>>IIRC for Athlons the Intel compilers are equivalent (They're INTEL compilers
>>>after all ;-)
>>>
>>>>How about Deep Sjeng? You posted your 64-bit numbers. Do you have any numbers
>>>>that would compare an equivalently clocked Opteron and 32-bit Athlon both
>>>>running 32-bit code? You still haven't told us if Deep Sjeng uses bitboards,
>>>>which makes it difficult to extract meaning from your 64-bit numbers. A 70%
>>>>speedup for a bitboard program is very nice, but a 70% speedup for a
>>>>non-bitboard program would really say something, considering Crafty only gets
>>>>about a 60% boost.
>>>
>>>Considering you know that Crafty is the archetypal bitboard program,
>>>and that I haven't exactly kept my opinions about bitboards secret,
>>>perhaps the answer to that question isn't _that_ hard to figure out.
>>>
>>>>That is something I've been very curious about lately, whether a chess program
>>>>that doesn't use 64-bit values heavily (0x88, any array based program, etc.)
>>>>will get much of a speed boost on the Opteron compared to the fastest 32-bit
>>>>processors. Crafty is already faster than a lot of non-bitboard programs on
>>>>32-bit hardware. If it gets a 60-70% boost, while others get a 10-20% boost,
>>>>that's a significant blow to the non-bitboarders.
>>>
>>>I'll have more Opteron data 'soon'.
>>>
>>>But really, the chip is fast 32 bit and BLAZING 64 bit. I don't understand
>>>why people still have questions. I don't. And I've noticed Bob and Eugene
>>>don't have any more either these days ;)
>>>
>>>--
>>>GCP
>>
>>
>>I _still_ have questions.  The NUMA issues are non-trivial.  Memory hot-spots
>>kill performance.  Etc.  But I agree that done right, a program can really zip
>>right along.
>>
>>After studying it quite a bit, I would not yet suggest Linux as the platform
>>for a NUMA box, yet.  I'm looking at it closely and may fool around with it
>>some myself once we get an opteron in here, if the problems are not solved
>>before then.  The issues are "interesting" to say the least.  The hash table
>>is just _one_ example of what is "interesting".
>
>Use Windows :-)
>
>I believe we resolved majority of issues (though yes, Windows NUMA API is
>minimalist and sometimes ugly). If you'll hit some non-driver-related problems
>AMD or NT people will help you :-)
>
>Thanks,
>Eugene


The issues that have my interest on Linux are the following:

1.  I can start a new thread using clone() and I can (must) create a stack
for that thread before it begins execution.  that's easy.  But what is not
easy is figuring out which local memory I need and then allocating memory
there.  The only viable solution I have thought about so far is to add a
new Nmalloc() sort of function to linux which lets me supply either a processor
ID or a real memory range (the physical memory of the processor I want this
thread to run on).  That's not there yet, in Linux.  It is not hard to add
a new system call to do this however, and it is likely to be mandatory to make
NUMA work well.

2.  I can't bind a thread to a processor, so even if I get the stack in the
right area of memory, I can't guarantee that thread runs on that processor so
that it runs efficiently.  This is also needed, and this part is not that hard
to add to the scheduler.  I would _prefer_ a better methodology where the
system keeps up with the real memory where stuff is malloc()'ed and tries to
schedule a thread on the cpu where that is local.  Maybe the stack would be
the critical thing to match up to since that is hit on all the time.

3.  Then there is the issue of "code".  Maybe sharing code is good, as you
mentioned, due to cache utilization.  But then again, maybe it should be
replicated so that it appears local to the processor using it.  This could
be an automatic thing (IE we share everything at the moment using copy-on-write
to make exclusive copies of pages only when they are modified, when using a
non-thread-based approach such as heavyweight processes with fork().  Maybe
we could do a "copy on use" for executable or non-writable pages so that they
migrate to the local memory of the cpu running that process.

Lots of ideas.  Lots of things to consider.  That's just scratching the
surface...
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.