Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: MP system info

Author: Vincent Diepeveen

Date: 13:34:12 05/28/02

Go up one level in this thread


On May 28, 2002 at 15:42:33, Robert Hyatt wrote:

>On May 28, 2002 at 14:36:36, Vincent Diepeveen wrote:
>
>>On May 28, 2002 at 12:53:44, Robert Hyatt wrote:
>>
>>>On May 28, 2002 at 10:06:42, Vincent Diepeveen wrote:
>>>
>>>>On May 28, 2002 at 09:06:36, K. Burcham wrote:
>>>>
>>>>for computerchess that is way too optimistic Kim.
>>>>
>>>>programs like Cray Blitz or DIEP might do pretty well at
>>>>8 processors, but crafty, fritz, sos, shredder, patzer,
>>>>junior and these programs
>>>>scale pretty bad at 8 processors.
>>>
>>>
>>>What on earth are you talking about when you mention Crafty?  I have
>>>run crafty on 16 cpu machines and it works just as well as it does on
>>>4...  From actual testing, not "speculation".
>>
>>i'm talking about worst case speedup.
>>
>>*not* average case or best case. We both know that some>>testset positions you could get 100 times speedup with some luck
>>on a 16 processor.
>>
>
>happens _very_ rarely.  In fact, in my test set that I use I _never_
>get a speedup beyond 4.0 on my quad.   I weed the oddball positions out.
>I also am not aware of any positions where I get a "speedup < 2.0" on a
>quad either.  Perhaps one exists.  But one outlying data point is not
>very interesting.  It is the "usual" performance that I try to worry
>about.

instead of that testset you should use a bunch of positions
from crafty in world championships. like against junior.
I still do not understand how it managed to blow this.

pawn up!

most likely score went down there. though it was only 2 processors,
same lemma applies.

>
>
>
>>>
>>>>
>>>>bandwidth is not the issue here. speedup is the issue here.
>>>>
>>>
>>>
>>>For the 8-way machines, bandwidth _is_ the issue.  4-way boxes use
>>>4-way interleaving to provide enough memory bandwidth for 4 cpus.
>>>8-way boxes lose in two ways.  (1) they still use 4-way memory
>>>interleaving;  (2) the cache coherency hardware treats the machine as
>>>two "clusters" of 4 cpus, making "inter-cluster" cache coherency less
>>>efficient than on the 4-way clusters...
>>
>>i didn't know that from 8 processor machines. that makes 8 processor
>>machines even more interesting.
>
>Look for details on the "fusion" chip-set...  the docs explain how this
>kludge was accomplished.
>
>
>
>>
>>>
>>>
>>>>If you split at random like most of these programs do, then
>>>>you have simply major speedup problems soon.
>>>>
>>>>In case of patzer a big issue is that it is tactical extending
>>>>a lot, so the search space is not identical (i don't even
>>>>know whether it runs at 8 processors).
>>>>
>>>>So where crafty gets 1.7 speedup at 2 processors and like 2.5 speedup
>>>>at 4 processors at crucial moments (when score drops a little) in
>>>>the game, there the speedup at 8 processors for these random
>>>>splitting programs is very horrible at 8 processors;
>>>
>>>Crafty runs consistently at > 3.0 speedup at 4 processors.  I have posted
>>>the data for several positional/tactical test positions that clearly proves
>>>this.
>>>
>>>Creating numbers out of the clear blue simply is not productive.
>>
>>you can see it even watching the whisper from crafty.
>>some critical positions it gets 11 ply, then all the other positions
>>it gets 14 to 15 ply search depth. Everyone who watches
>>the whispers can do the math already.
>>
>
>
>Depth has nothing to do with speedup.  I can run with 1 cpu and
>see 3 consecutive 12 ply searches followed by one 10 ply search.
>
>Wrong data to look at...
>
>Of course, in a game it is worse.  If it predicts correctly and you think
>for a long time, of course it will go deeper.  If it predicts incorrectly
>and starts from scratch, it will not go as deep.
>
>Too many variables to draw conclusions based only on search depth.  Better
>to pick the questionable positions and run 'em with 1 and 4 processors to
>see how things work.
>
>
>
>
>>>
>>>>
>>>>in some positions you get 10 times speedup, in other positions a
>>>>2 times speedup. When you need the speedup you don't get it.
>>>
>>>Perhaps not all programs behave this badly?
>>
>>Crafty does, but that's logical. you split *at random*.
>
>I don't split "at random".  Young Brother's wait is not "at random" at
>all.  A move (or more) has/have already been searched without a fail-high,
>which means that the rest will very likely have to be searched as well as
>this is probably a fail low node.
>
>No different that what I did in Cray Blitz, in actuality, the two approaches
>seem very close.
>
>
>> So
>>if a position goes bad, you split bad. And if you split bad,
>>chance is statistically higher you again split worse and worse.
>
>You are going to "split bad" no matter what you do, at times.  Because
>alpha/beta is purely serial in nature.
>
>
>
>>
>>If it goes great, then you have a statistic chance it goes even
>>better with the splitting.
>>
>>>
>>>>
>>>>Anyway this is all theoretic discussion. I am pretty sure chessbase
>>>>doesn't want to buy a 8 way Xeon system, even though they can afford
>>>>the $100k easily.
>>>>
>>>>With regard to memory i need to mention that memory is faster on
>>>>these systems than at our slow dual systems (with respect to memory),
>>>>memory goes in parallel at the big machines, it doesn't at dual
>>>>machines.
>>>
>>>However, the 4-way and 8-way boxes share the _same_ memory system.
>>
>>I can imagine that at a quad, where diep also has *zero* latency
>>problems with the memory, that the problems are way smaller than
>>at a dual 2 Ghz K7.
>
>
>
>You don't have _zero_ problems.  The multi-cpu interference is real
>and measurable.  A second CPU interferes with the first and causes a
>7% drop in performance for both cpus.  4 cpus result in a 20%+ drop in
>performance for all 4 cpus.  It is because of memory bandwidth.
>
>
>
>>
>>>
>>>>
>>>>Best regards,
>>>>Vincent
>>>>
>>>>>In absolute terms, the 8-way Pentium 3 Xeon systems are only 44% faster than the
>>>>>4-way ones, which means that with the 4 extra CPUs, the system only gets 1.76
>>>>>CPUs worth of extra performance, which is poor value for money. This level of
>>>>>scalability is not that surprising since each group of 4 CPUs share 0.8GByte/s
>>>>>of memory bandwidth. As a side note, it seems likely though that 252.eon fits
>>>>>almost perfectly into the 2MByte cache the Pentium 3 Xeons have as it gets
>>>>>nearly linear scalability - the higher the cache hit rate, the less main memory
>>>>>is needed, which leaves more for the other CPUs.
>>>>>
>>>>>Even worse, in some tests, the 8-way system actually does worse than the 4-way
>>>>>system, and this could possibly be due to differences in the chipsets or because
>>>>>the extra contention itself on the shared Pentium system bus causes efficiency
>>>>>to drop. It's unlikely that the compilers/OS would have made much difference as
>>>>>for each CPU type the tests were done at similar times with the same compilers.
>>>>>
>>>>>
>>>>>http://www.aceshardware.com/read.jsp?id=45000338
>>>>>
>>>>>kburcham



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.