Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: New intel 64 bit ?

Author: Robert Hyatt

Date: 07:45:27 07/07/03

Go up one level in this thread


On July 06, 2003 at 16:14:43, Vincent Diepeveen wrote:

>On July 04, 2003 at 23:33:46, Robert Hyatt wrote:
>
>>On July 03, 2003 at 20:13:06, Vincent Diepeveen wrote:
>>
>>>On July 03, 2003 at 18:15:01, Robert Hyatt wrote:
>>>
>>>>On July 03, 2003 at 16:50:29, Chris Hull wrote:
>>>>
>>>>>On July 03, 2003 at 13:03:13, Robert Hyatt wrote:
>>>>>
>>>>>>On July 03, 2003 at 05:51:51, Russell Reagan wrote:
>>>>>>
>>>>>>>On July 03, 2003 at 05:31:15, Tony Werten wrote:
>>>>>>>
>>>>>>>>http://www.digitimes.com/NewsShow/Article.asp?datePublish=2003/07/01&pages=02&seq=3
>>>>>>>>
>>>>>>>>Tony
>>>>>>>
>>>>>>>Interesting news. Some things the article says makes me think this is nothing to
>>>>>>>get excited about.
>>>>>>>
>>>>>>>"targeting the high-priced, back-end server market" - This makes me think
>>>>>>>"nothing new here, the Itanium has been out of the price range of everyone for
>>>>>>>years anyway." I can't imagine them competing with the Opteron (much less
>>>>>>>Athlon64) if they can't come way down in price.
>>>>>>>
>>>>>>>It says something about a lower end cpu for workstations, but the way they put
>>>>>>>it (maybe it's just the writer), it makes it sound (to me) that the high-end
>>>>>>>Itanium will still be significantly more than the Opteron, and the low-end
>>>>>>>Itanium will still be significantly more than the Athlon64, and that the
>>>>>>>really-low-end Xeon might be in the price range of the Opteron.
>>>>>>>
>>>>>>>
>>>>>>>"Intel servers containing eight to 128 Itanium processors..."
>>>>>>>
>>>>>>>So Bob, what is the expected speedup of Crafty on a 128-Itanium machine? :)
>>>>>>
>>>>>>
>>>>>>Hard to say since it is a NUMA type machine.  There are lots of issues
>>>>>>there.
>>>>>
>>>>>Ok, this begs the question, "Can crafty be made to work on a NUMA-type cluster?
>>>>>How about in a messaging passing cluster using PVM or MPI?" Not just made to
>>>>>work but to actually see SMP like speedups, for 4/8/16/32/64 node clusters.
>>>>>
>>>>>Chris
>>>>
>>>>
>>>>The answer to both is "yes".
>>>>
>>>>NUMA is a problem, but it is solvable.  The problem is that the current
>>>>way of allocating "split blocks" is not good for NUMA machines.  A NUMA
>>>>machine _really_ wants its often-accessed data to be in its local memory,
>>>>and I don't have any way of forcing that at the moment.  It would not be
>>>>terribly difficult to change it, by allocating a bunch of split blocks on
>>>>each CPU/local-memory, and then ensuring that the right split block is used
>>>>for the right processor.  But on an SMP box, this is moot so it was not done
>>>>in the original design.
>>>>
>>>>Clustering is harder, since suddenly there is no shared memory at all,
>>>
>>>He asked MPI library that *means* not using shared memory.
>>>
>>
>>I believe I said that.
>>
>>
>>>Also all those itanium things are sold as 'clusters'.
>>>
>>>Latency from good old origin3800 with MPI even is way better than from the new
>>>SGI Altix3000 with Madisons and MPI.
>>>
>>>That's weird because the design looks ok to me.
>>>
>>>Then please consider that this altix is kicking butt compared to other itanium
>>>clusters at that too much praised TPC bench.
>>>
>>>>which changes both the overall structure of the program as well as the
>>>>underlying assumption that "it is easy to do a quick parallel search and
>>>>get a result back" because network latency suddenly turns something quick
>>>>into something with a significant latency.
>>>>
>>>>SMP-like speedups are likely not possible for chess, because of the way the
>>>>alpha/beta algorithm is built around sequential searching.  But reasonable
>>>>speedup is definitely possible.  Who cares if 1000 processors is only 100X
>>>>faster.  100X is _way_ faster.
>>>
>>>How to get 100x faster out of 1000 cpu's with MPI?
>>>
>>>Please tell me. It is not so trivial.
>>
>>I don't know whether it is trivial or not, I have not yet tried it with
>>chess.  But getting 10% efficiency seems to be doable based on results
>>over the years by Schaeffer.  10% is lousy for SMP, particularly when
>>SMP doesn't offer many CPUS.  But 10% would be acceptable for a cluster
>>since it can be arbitrarily large.
>
>Let me adress a number of points:
> - Schaeffer never used APHID with more than 16 processors

So?  _I_ didn't mention APHID at all.  I was thinking of "Sun Phoenix" that
ran on at least 20 machines in a few events, and it did not use APHID.  You
_can_ do distributed computing with sockets and TCP/IP.


> - his APHID showed a branching factor of around 10.0 or so in chess

Again, so?


> - my draughtsprogram (draughts is also called international 10x10 checkers)
>napoleon at a 10x10 board with more possibilities and with 20 pieces, is nearly
>outsearching Schaeffers chinook. Note that chinook plays at a small 8x8 board
>with simpler rules than draughts, namely using checker rules and 12 pieces.
> - the simple checkers game is trivially dominated by every searchline ending in
>EGTBs, unlike chess and unlike draughts. 8x8 american Checkers is not
>interesting at all. There is no money to earn there. There never was. There was
>just 1 good player in history called Marion Tinsley. Just read Schaeffers book
>to know more about it.
>
>I will be the last to say that in its time Schaeffer did bad research. In
>contradiction. if we look at what moment in history schaeffer said things, he
>definitely did very positive things.
>
>However it would be an illusion to guess that an algorithm that has problems
>scaling, and that for sure isn't maintaining a <= 3.0 branching factor which
>chessprograms have today, that an algorithm designed for having little overhead
>and big branching factor problems, that this algorithm is going to work or even
>give a 10% speedup at 480 or 500 processors.

The branching factor of today has nothing to do with chess programs of the
past, and their performance/scalability.  I'm doing OK with a branching factor
of 3.  I did ok with a branching factor of 6-10.  I don't see any real
difference at all, in terms of parallel search vs branching factor, in fact.

>
>Note that Schaeffer claims at 16 processors a way better speedup. Like somewhere
>above 10 if i remember well. This at a cluster.

A cluster using ethernet...


>
>Schaeffer was not an idiot in this respect that he was cheating the persons
>reading his papers. He explicitly writes that his algorithm is only working well
>when each processor is going to do equal work. If you use a hashtable then you
>get cutoffs here and there. Then the algorithm has problems balancing the tree.
>Not to mention when you use nullmove. Then the algorithm will die in advance.

Better read some of his papers.  He discussed shared transposition tables
multiple times in papers on Sun Phoenix.


>
>The principle is very easy. 1 master is creating jobs for slaves. Each slave
>then searches that job and gives it back to the master. So the master is a
>central process that is maintaining the entire tree.

Sun Phoenix didn't search like that.  It was very dynamic, and similar to
Cray Blitz except that it used distributed stuff that hurt overall performance.

>
>That trivially is not going to work at a big supercomputer in advance.
>
>Imagine that you at a *central* point at a supercomputer maintain the search
>tree for *every* processor. This would be *sick*.


It would be "sick" for any large-scale parallel box, but in this context,
"so what"??  It isn't the only way to do parallel searches.



>
>Another problem is the reaction time.
>
>Let's use practical example.
>
>Crafty at 64 processor itanium2 machine, or even a 64 processor Cray  machine
>(that's in future using AMD-OPTERON processors in case you haven't read latest
>statemetns of Cray company very well; because hypertransport gives of course a
>huge bandwidth, more than todays supercomputers in fact have PRACTICALLY).
>Crafty gets 1.5 million nodes a second a processor.

You are simply wrong about the Cray.  You are looking at the _wrong_ type
of box.  Hypertransport will _never_ beat the last true Cray T90 in terms
of bandwidth.  A single T90 processor can transfer 48 bytes a nanosecond,
reading 4 64 bit values and writing 2 64 bit values.  The Opteron can't
touch that.  And the Cray could do it simultaneously on 32 processors.


>
>That for 64 processors.
>
>Suppose 1 master gives out commands that steer each processor. What search depth
>do we take?

Who cares?  Nobody would write code like that for significant numbers of
Processors.  Not me, not Jonathan.

>
>5 ply?
>
>12 ply?
>
>Now suppose that you give to 40 processors 40 jobs of 12 ply. After a while
>first processor gives a cutoff.
>
>39 processors did work for shit then.
>
>That's why YBW is superior to everything nowadays in parallel search.
>
>That's why crafty uses YBW and not aphid. That's why DIEP uses YBW.
>
>>
>>
>>
>>
>>>
>>>I got 500 processors with OpenMP, way better than MPI. But still very hard to
>>>get a good speedup with. Working a year nearly already. It is *not* trivial.
>>
>>Better look at how "OpenMP" is implemented before saying that.  And it isn't
>>"way better than MPI".  Both use TCP/IP, just like PVM.  Except that MPI/OpenMP
>>is designed for homogeneous clusters while PVM works with heterogeneous mixes.
>>But for any of the above, the latency is caused by TCP/IP, _not_ the particular
>>library being used.
>>
>>There are alternatives, such as those used on NUMA clusters, where TCP/IP
>>is not used, and the backplane of the architecture is used to transport the
>>messages.  But if OpenMP beats MPI it only means that someone hasn't ported
>>MPI very well on the platform you are using.  Or it may mean OpenMP is using
>>the hardware you have while MPI is still using TCP/IP even if it goes over the
>>backplane.
>>
>>
>>
>>
>>
>>>
>>>Best regards,
>>>Vincent



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.