Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Conclusion

Author: Robert Hyatt
Date: 17:10:31 12/28/03
On December 28, 2003 at 15:51:20, Vincent Diepeveen wrote:

>On December 27, 2003 at 13:59:12, Robert Hyatt wrote:
>
>Cut the nonsense Bob,
>the opteron has a 3.5 faster latency local than your quad Xeon,

It absolutely _does_ not.  Measured latency is on the order of 80ns.  If
you trash the TLB, multiply by 3 just like any X86 memory architecture...


>and main memory access latency is one of the bigger problems of Crafty,
>so of course anything works well at that hardware SMP.
>
>Your thing is SMP, not NUMA. Don't confuse the 2 things. Your thing won't run
>512 processor SGI origin3800 of course with 5.8 us latency of course.

It won't _now_.  It might one day _if_ the origin was an interesting machine.

However, it is _not_ that interesting.  There are many better boxes.  I'd
much rather have a 64-way Itanium or Alpha than a 512-way MIPS machine.

>
>Note this is a very good latency.
>
>Clusters that are in top5 of top500.org have around 10-20 us latency.
>
>Only T3E is a few microseconds less than this which is of course $$$$$$$$$. Fill
>in digits 0..9 but don't start with a 0 :)
>
>Your dual Xeon has about 400ns latency, my dual K7 of course too (both 133Mhz
>RAM with a chipset in between so no big surprises there that it is the same).

My dual xeon has a 150ns latency.  If you blow the TLB, memory access time
goes to 450ns because of the two extra memory accesses to translate a virtual
to a real address.

>
>However the Opteron has around 120 ns latency to local memory and only a very
>slightly more dual and quad.

Wrong.  Again.  Remember I have actually _run_ on one.  Local memory has
a latency of X.  Memory one hop away has a latency of roughly 2X.  If you
have 4 cpus, each CPU has local memory, two processors one hop away, and one
that is 2 hops away.  The latency for 2 hops is about 3x.  And _those_
numbers are not made up.  They were measured by me, and confirmed by AMD.
I'm not at liberty to exactly specify what X is, but it is not as low as you
think it is.

>
>So your old 'smp' code when the cache line length gets taken into account will
>perform also a lot better than your current code. Of course some code that binds
>threads to a certain CPU is nice to have. That's as far as i know all Nalimov
>did for you, perhaps he wants to comment on that here...


No we did more.  We allocate "split blocks" in local memory for each CPU and
then bind the thread using that memory to that processor.  We interleave the
hash memory over all CPUs.  Eugene wrote the windows-API code to actually
do the local and interleaved malloc()s, I made the changes to Crafty itself.
There were a few other changes as well, but mostly to Crafty and mainly having
to do with some cache efficiency issues.


>
>Calling crafty NUMA is the biggest nonsense i have ever heard.

Simply because _you_ haven't looked at the code?  That's what I thought.
And no, it is _not_ finished yet.  But yes, it _does_ work well on the
AMD machine which is where we started.




>
>Here is what happens if you will run on a big machine with a 16+ cpu's and not
>shared memory bus.
>
>Note that a 2080 processor IBM opteron 2.0 Ghz is $10 million or in those
>ranges, a 1000 processor itanium2 is around $10 million and a 16 processor
>shared memory bus alpha is also $10 million.
>
>So taking shared memory bus machines here as example is not a good idea.
>
>If your thing would be numa it would run easily 100+ processors.
>
>Crafty will hammer with all cpu's onto the same datastructure of course
>allocated at cpu0, globally locking everything.

I don't do that.  I never did do that.  You can keep saying it, but that
won't ever make it true.

>
>So the more processors you add the more it will die.
>
>At the origin3800 or altix3000 you can easily see this with crafty when running
>interactive, because when too much gets locked, the total system time eaten
>isn't close to X when running at X processors.
>
>It dies simply :)


So, it has _never_ been modified to use such crumby machines.  At present,
it is working reasonably on windows machines _only_.  I am working on the
Linux changes that match what we did for windows.  But I'm not working on
an IRIX version, I have not worked on an IRIX version, so you can keep
talking about IRIX all you want but it simply doesn't matter...

And when you _finally_ let it sink in the Crafty has not been tweaked for
IRIX NUMA, you will _finally_ understand why your test is meaningless.  It
doesn't have any IRIX code in it, get it?  BTW did I mention that I have
_never_ tried to tweak this for IRIX NUMA?  Did I mention that I have no plans
at the present to support IRIX NUMA?  Oh yes, don't forget that Crafty does
_not_ work well on IRIX NUMA boxes.

Did I repeat that enough?  Probably not...
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.