Author: Vincent Diepeveen
Date: 13:34:12 05/28/02
Go up one level in this thread
On May 28, 2002 at 15:42:33, Robert Hyatt wrote: >On May 28, 2002 at 14:36:36, Vincent Diepeveen wrote: > >>On May 28, 2002 at 12:53:44, Robert Hyatt wrote: >> >>>On May 28, 2002 at 10:06:42, Vincent Diepeveen wrote: >>> >>>>On May 28, 2002 at 09:06:36, K. Burcham wrote: >>>> >>>>for computerchess that is way too optimistic Kim. >>>> >>>>programs like Cray Blitz or DIEP might do pretty well at >>>>8 processors, but crafty, fritz, sos, shredder, patzer, >>>>junior and these programs >>>>scale pretty bad at 8 processors. >>> >>> >>>What on earth are you talking about when you mention Crafty? I have >>>run crafty on 16 cpu machines and it works just as well as it does on >>>4... From actual testing, not "speculation". >> >>i'm talking about worst case speedup. >> >>*not* average case or best case. We both know that some>>testset positions you could get 100 times speedup with some luck >>on a 16 processor. >> > >happens _very_ rarely. In fact, in my test set that I use I _never_ >get a speedup beyond 4.0 on my quad. I weed the oddball positions out. >I also am not aware of any positions where I get a "speedup < 2.0" on a >quad either. Perhaps one exists. But one outlying data point is not >very interesting. It is the "usual" performance that I try to worry >about. instead of that testset you should use a bunch of positions from crafty in world championships. like against junior. I still do not understand how it managed to blow this. pawn up! most likely score went down there. though it was only 2 processors, same lemma applies. > > > >>> >>>> >>>>bandwidth is not the issue here. speedup is the issue here. >>>> >>> >>> >>>For the 8-way machines, bandwidth _is_ the issue. 4-way boxes use >>>4-way interleaving to provide enough memory bandwidth for 4 cpus. >>>8-way boxes lose in two ways. (1) they still use 4-way memory >>>interleaving; (2) the cache coherency hardware treats the machine as >>>two "clusters" of 4 cpus, making "inter-cluster" cache coherency less >>>efficient than on the 4-way clusters... >> >>i didn't know that from 8 processor machines. that makes 8 processor >>machines even more interesting. > >Look for details on the "fusion" chip-set... the docs explain how this >kludge was accomplished. > > > >> >>> >>> >>>>If you split at random like most of these programs do, then >>>>you have simply major speedup problems soon. >>>> >>>>In case of patzer a big issue is that it is tactical extending >>>>a lot, so the search space is not identical (i don't even >>>>know whether it runs at 8 processors). >>>> >>>>So where crafty gets 1.7 speedup at 2 processors and like 2.5 speedup >>>>at 4 processors at crucial moments (when score drops a little) in >>>>the game, there the speedup at 8 processors for these random >>>>splitting programs is very horrible at 8 processors; >>> >>>Crafty runs consistently at > 3.0 speedup at 4 processors. I have posted >>>the data for several positional/tactical test positions that clearly proves >>>this. >>> >>>Creating numbers out of the clear blue simply is not productive. >> >>you can see it even watching the whisper from crafty. >>some critical positions it gets 11 ply, then all the other positions >>it gets 14 to 15 ply search depth. Everyone who watches >>the whispers can do the math already. >> > > >Depth has nothing to do with speedup. I can run with 1 cpu and >see 3 consecutive 12 ply searches followed by one 10 ply search. > >Wrong data to look at... > >Of course, in a game it is worse. If it predicts correctly and you think >for a long time, of course it will go deeper. If it predicts incorrectly >and starts from scratch, it will not go as deep. > >Too many variables to draw conclusions based only on search depth. Better >to pick the questionable positions and run 'em with 1 and 4 processors to >see how things work. > > > > >>> >>>> >>>>in some positions you get 10 times speedup, in other positions a >>>>2 times speedup. When you need the speedup you don't get it. >>> >>>Perhaps not all programs behave this badly? >> >>Crafty does, but that's logical. you split *at random*. > >I don't split "at random". Young Brother's wait is not "at random" at >all. A move (or more) has/have already been searched without a fail-high, >which means that the rest will very likely have to be searched as well as >this is probably a fail low node. > >No different that what I did in Cray Blitz, in actuality, the two approaches >seem very close. > > >> So >>if a position goes bad, you split bad. And if you split bad, >>chance is statistically higher you again split worse and worse. > >You are going to "split bad" no matter what you do, at times. Because >alpha/beta is purely serial in nature. > > > >> >>If it goes great, then you have a statistic chance it goes even >>better with the splitting. >> >>> >>>> >>>>Anyway this is all theoretic discussion. I am pretty sure chessbase >>>>doesn't want to buy a 8 way Xeon system, even though they can afford >>>>the $100k easily. >>>> >>>>With regard to memory i need to mention that memory is faster on >>>>these systems than at our slow dual systems (with respect to memory), >>>>memory goes in parallel at the big machines, it doesn't at dual >>>>machines. >>> >>>However, the 4-way and 8-way boxes share the _same_ memory system. >> >>I can imagine that at a quad, where diep also has *zero* latency >>problems with the memory, that the problems are way smaller than >>at a dual 2 Ghz K7. > > > >You don't have _zero_ problems. The multi-cpu interference is real >and measurable. A second CPU interferes with the first and causes a >7% drop in performance for both cpus. 4 cpus result in a 20%+ drop in >performance for all 4 cpus. It is because of memory bandwidth. > > > >> >>> >>>> >>>>Best regards, >>>>Vincent >>>> >>>>>In absolute terms, the 8-way Pentium 3 Xeon systems are only 44% faster than the >>>>>4-way ones, which means that with the 4 extra CPUs, the system only gets 1.76 >>>>>CPUs worth of extra performance, which is poor value for money. This level of >>>>>scalability is not that surprising since each group of 4 CPUs share 0.8GByte/s >>>>>of memory bandwidth. As a side note, it seems likely though that 252.eon fits >>>>>almost perfectly into the 2MByte cache the Pentium 3 Xeons have as it gets >>>>>nearly linear scalability - the higher the cache hit rate, the less main memory >>>>>is needed, which leaves more for the other CPUs. >>>>> >>>>>Even worse, in some tests, the 8-way system actually does worse than the 4-way >>>>>system, and this could possibly be due to differences in the chipsets or because >>>>>the extra contention itself on the shared Pentium system bus causes efficiency >>>>>to drop. It's unlikely that the compilers/OS would have made much difference as >>>>>for each CPU type the tests were done at similar times with the same compilers. >>>>> >>>>> >>>>>http://www.aceshardware.com/read.jsp?id=45000338 >>>>> >>>>>kburcham
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.