Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: AMD A64 X2 DUALCORE

Author: Robert Hyatt

Date: 13:40:11 08/08/05

Go up one level in this thread


On August 04, 2005 at 05:18:43, Vincent Diepeveen wrote:

>On August 03, 2005 at 13:34:09, Robert Hyatt wrote:
>
>>On August 03, 2005 at 10:21:20, Gian-Carlo Pascutto wrote:
>>
>>>On August 03, 2005 at 10:13:45, Sedat wrote:
>>>
>>>>Hi there,
>>>>
>>>>Does anybody has any information about this processor ?
>>>>
>>>>-Can i run engine-matches with ponder on ?
>>>>
>>>>I mean:
>>>>-Does the kns of the engines will fall down ?
>>>>
>>>>And if its possible to run ponder on  matches :
>>>>-is it enough just one  processor or i need to buy two  processors ?
>>>
>>>A single dualcore processor behaves almost exactly like a 2 processor machine.
>>>
>>>--
>>>GCP
>>
>>This needs a _lot_ more testing before saying that so positively.  I've been
>>testing on a quad dual-core box, and there are most definitely "issues" to deal
>>with that I/we have not yet solved.  There are some memory issues that I am
>>working on quantifying, probably related to two cores sharing a memory bank and
>>the associated bus contention.  First cut on the quad 875 box produced some
>>really ugly SMP results for me, with the NPS "scalability" only reaching 4X
>>generally, where on the quad 850 I last tested on, it scaled perfectly for 1-4
>>processors...
>>
>>More as I work out the glitches (I hope)..
>
>This is because crafty doesn't scale.



Here is some NPS data from a quad 875.  Perhaps this will forever put to rest
your crap of "Crafty doesn't scale".  You said it for two cpus.  When proved
wrong, you said "OK, scales for 2 cpus, won't scale for 4."  Last year's quad
850 put that to rest, so your battle-cry became "ok, scales for 4 but it won't
scale for 8."  Only problem with that bit of nonsense is the data that follows:

              time=3:20  mat=-3  n=371229148  fh=93%  nps=1.86M
              time=3:20  mat=-3  n=387456032  fh=94%  nps=1.94M
              time=3:20  mat=-2  n=393026421  fh=94%  nps=1.97M
              time=3:20  mat=-2  n=438914927  fh=96%  nps=2.19M
              time=3:20  mat=-1  n=446588116  fh=95%  nps=2.23M
              time=3:20  mat=-2  n=431924227  fh=93%  nps=2.16M
              time=3:20  mat=-2  n=438675907  fh=94%  nps=2.19M
              time=3:20  mat=1  n=418326816  fh=92%  nps=2.09M
              time=3:20  mat=-1  n=423806907  fh=92%  nps=2.12M
              time=3:20  mat=-1  n=412172447  fh=91%  nps=2.06M
              time=3:20  mat=0  n=452391068  fh=93%  nps=2.26M
              time=3:20  mat=0  n=452926645  fh=93%  nps=2.26M
              time=3:20  mat=0  n=452392284  fh=93%  nps=2.26M
              time=3:20  mat=0  n=463384147  fh=92%  nps=2.32M
              time=3:20  mat=0  n=458280168  fh=92%  nps=2.29M
              time=3:20  mat=5  n=450563188  fh=93%  nps=2.25M
              time=3:20  mat=0  n=448600113  fh=92%  nps=2.24M
              time=3:20  mat=0  n=430064921  fh=92%  nps=2.15M
              time=3:20  mat=-1  n=389967561  fh=91%  nps=1.95M
              time=3:20  mat=-1  n=404756618  fh=90%  nps=2.02M
              time=3:20  mat=-1  n=393679306  fh=90%  nps=1.97M
              time=3:20  mat=-1  n=391818850  fh=90%  nps=1.96M
              time=3:20  mat=-1  n=397601175  fh=90%  nps=1.99M
              time=3:20  mat=-1  n=400104638  fh=90%  nps=2.00M
opteron% grep nps cbrun.log.mt=8b.200
              time=3:20  mat=-3  n=2806042190  fh=93%  nps=14.03M
              time=3:20  mat=-3  n=2916788431  fh=93%  nps=14.58M
              time=3:20  mat=-2  n=2873784299  fh=93%  nps=14.37M
              time=3:20  mat=-2  n=3103862678  fh=95%  nps=15.52M
              time=3:20  mat=-1  n=3337271703  fh=94%  nps=16.69M
              time=3:20  mat=-2  n=3260770604  fh=93%  nps=16.30M
              time=3:20  mat=-2  n=3354633576  fh=93%  nps=16.77M
              time=3:20  mat=1  n=3226460542  fh=92%  nps=16.13M
              time=3:20  mat=-1  n=3139153516  fh=92%  nps=15.70M
              time=3:20  mat=-1  n=3244312269  fh=92%  nps=16.22M
              time=3:20  mat=0  n=3297166602  fh=92%  nps=16.49M
              time=3:20  mat=0  n=3413063738  fh=92%  nps=17.07M
              time=3:20  mat=0  n=3376699324  fh=92%  nps=16.88M
              time=3:20  mat=0  n=3311204158  fh=91%  nps=16.56M
              time=3:20  mat=0  n=3374495270  fh=92%  nps=16.87M
              time=3:20  mat=5  n=3298033933  fh=92%  nps=16.49M
              time=3:20  mat=0  n=3224183497  fh=91%  nps=16.12M
              time=3:20  mat=0  n=3056891697  fh=91%  nps=15.28M
              time=3:20  mat=-1  n=2870594426  fh=91%  nps=14.35M
              time=3:20  mat=-1  n=3023374513  fh=90%  nps=15.12M
              time=3:20  mat=-1  n=2977525743  fh=90%  nps=14.89M
              time=3:20  mat=-1  n=2993177265  fh=89%  nps=14.97M
              time=3:20  mat=-1  n=3003140555  fh=89%  nps=15.02M
              time=3:20  mat=-1  n=2996154121  fh=89%  nps=14.98M
opteron%


The first set of numbers if from a single-cpu run on the quad 875, over the old
Cray Blitz test positions we were using in last years argument.  The second set
of numbers is with 8 cpus.  Time limit for both runs was 200 seconds per
position.  Numbers are not "pefect" yet, but for the first position, the scaling
is 7.54X (NPS scaling).  Which is almost 8.0.

There is something fishy with posix threads on this SUSE system, so I
resurrected a non-threads version of Crafty that uses fork and system V shared
memory to get posix threads out of the equation.  Before you start salivating,
the parallel search is _identical_.  Shared hash.  Shared "tree" structure
blocks, I just put all the other global shared data (locks, anything modified
inside the search) into yet another shared struct, and address it via another
SYSV shared memory block.  So the only changes needed were to use fork() to
spawn processes, and prior to that the shmget/shmat system calls to allocate
memory that is shared across fork().  The old version (threads) will soon be
tested on windows to see if it still scales perfectly as it did last year...

Hopefully, this will put your bunk claims to bed, once and for all, at least for
NUMA boxes up to 8 processors.  When I find time, I'll dig up some more NUMA
results for larger numbers of processors before you get a chance to start "OK,
it will scale for 8, but it will never scale for 16."  Sorry, but it has
_already_ scaled to 16.  This non-thread version was something I worked on a
couple of years ago for a NUMA alpha box that didn't support threads across
nodes, but did support shared memory.  Took about two days to make the changes
the first time, about a day to make them to the current version plus a half-day
to debug a stupid error I made.  Still got a little work to get that last .5X
factor, related to making the SYSV memory allocation grab memory on the correct
node rather than on the node that initially shmget's the segment.  Not hard to
do, so by the weekend we ought to be at 8x again as we were at 4x last year.

Sorry to burst your bubble yet again.  But it is just so damned easy to do when
you make statements that you don't know anything at all about...









>
>Not a hardware issue.
>
>Memory latency is 234 ns to get 8 bytes of TLB trashing memory from 250MB
>buffers (in total 2GB ram for total testblock).
>
>Compare with 400 ns that your own dual Xeon needs to deliver the same
>and compare with 700 ns that 8 processor Xeon needs.

My dual xeon doesn't need 400ns to access a random word of memory.  We've
already been through this.  2mb memory pages solves that instantly.  And since
most normal memory references are somewhat sequential, the TLB gets hit > 99% of
the time anyway..



>
>I guess the central lock structure in crafty breaks it at 8 cpu's.

You guess wrong.  Has no effect whatsoever for 8.  Nor 16.  By the time it gets
to 64, it might begin to have an impact, but there are solutions...  The central
lock is a convenience, _not_ a necessity.



>
>Diep is not central locking, of course tested to work at ugly latencies
>until 500 cpu's and has zero problems with quad opteron dual core 1.8Ghz
>at which i play at.

I didn't have this problem on the first dual-core I ran on either.  There is
something different about the 875 that is unknown at present (to me).  But it
will be understood...



>
>Please note the latency for 2.2Ghz dual cores is far better because the
>latency of each memory controller is somewhat dependant upon the speed of the
>processor.

Please note that each memory controller has to handle 2x the requests.  The
MOESI traffic goes thru a single hypertransport interface as well, so it has to
handle 2x the traffic that it would normally see on a single-core processor.
There _are_ a few bottlenecks when you add a second core, and second L2, but
keep everything else "as is".


>
>So the problem is not the hardware at all, but software issues within crafty.

Based on what?  Since Crafty is doing just fine after getting rid of posix
threads, and also slightly reducing the depth at which parallel splits can be
done to control the memory bandwidth required by splitting too deeply in the
tree.



>
>Any default x86-64 core 2.6.10 or later by default already is NUMA and works
>perfectly. No need to compile your own core.

Core?  You mean "kernel"?

If so, try the "default RHE core" and force "numa=on" as a kernel argument.  The
result is "kernel PANIC."  Every time.  Suse seems to work perfectly, however,
and the libNUMA stuff is functioning perfectly now...


>
>I installed Ubuntu at quad, upgraded to x86-64 kernel (thanks to Mridul
>Muralidharan for his big help!) and it worked fine.
>
>Ubuntu is the superior distribution nowadays.
>
>Vincent

Don't know anything about it.  AMD has folks working with SUSE on their release,
and it's always worked perfectly for me.  In fact, the new gcc 3.3.3 actually
works for profile-guided optimizations for Crafty, where in the past it always
crashed badly (the compiler, not crafty).




This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.