Author: Robert Hyatt
Date: 10:20:09 01/30/03
Go up one level in this thread
On January 30, 2003 at 05:05:44, Vincent Diepeveen wrote: >On January 29, 2003 at 23:31:19, Matthew Hull wrote: > >>On January 29, 2003 at 23:20:11, Vincent Diepeveen wrote: >> >>>On January 29, 2003 at 12:06:50, Robert Hyatt wrote: >>> >>>Bob let me explain to you. DIEP is written for machines which have a bit slower >>>latency for global memory accesses. whereas the world champs 2002 version wasn't >>>like that and would probably act like crafty on that 8 processor, the end of >>>august 2002 versions and further are using a new type of parallellism which >>>doesn't need much locking. Each processor takes care of itself without hurting >>>bandwidth while searching too much. >>> >>>There is no dead slow global locks which is killing the 8 processor thing of >>>course. >>> >>>therefore it works great for example at cc-NUMA machines and all types of Xeon >>>machines. >> >> >>Wow dude. Impressive. Could you supply some time-to-ply benchmarks for Diep on >>8-way Xeon vis-a-vis 4-way Xeon. That would refute the proffessor like nothing >>else. > >right now i have no system time at a 8 way Xeon. i do have numbers for 1 .. n >processors at cc-NUMA SGI Origin 3800. Where i currently tested basically a lot >up to 16 cpu's and plan to extend that with short tests to > 32 cpu's. > >It is not easy to get a good speedup with a lot of processors. But a good >speedup is something completely else from a near to lineair speedup in nodes a >second. > >Bob is complaining about his crafty not getting good nps a second. Well we >already know that from the dual K7. single cpu 1 million nps or something and >dual only 1.5 MLN nps there or something (forgive me a 100k nps more or less >here; it just doesn't get a good sequential speedup even at it). Vincent, your ignorance never ceases to amaze me. What part of "I ran 8 differnet applications, including crafty" don't you understand? What part of "I ran an application once, and then I ran it up to 8 times in parallel" don't you understand??? > >So it is trivial that crafty *won't* run well on 8 cpu Xeon not to mention a >cc-NUMA machine. > >Extending that claim then to other software as well i find a very bad taste. You would. Except for the small flaw that I actually _ran_ on the box, with several different applications. You've never touched an 8-way xeon in your life, and you start handwaving about CC-NUMA which has _nothing_ to do with the 8-way xeon. > >Saying it is very very hard to write software that runs well on them is a >completely different statement from that. But that is not the statement I wrote. I said "I found the 8-way box to be only about 1.5X faster than my quad using _several_ applications. Only one of which (Crafty) was actually a parallel algorithm. > >Speedup at the machines is a different matter of course. 8 processors for diep >is the breaking number where the very well division of the x86 algorithm is not >working that great anymore and where the well working division of the many >processor splitting is still not working well. > >Special tuning would be needed how to split at 8 cpu's when compared to 2..4 >IMHO. > >the way inwhich i split at x86 is working pretty ok at 8 cpu's but can be >optimized further to avoid several milliseconds of cpu's getting idled. > >From my viewpoint optimizing the way DIEP splits is done for : > 2 processors (x86 also works well for 4 processors) > >and n processors (n > 16). speedup at 16 processors currently not very good. >of course that is the *absolute* speedup. No cheated toying like the Feldmann >group who is focussing upon number of nodes searched instead of searchtimes. > >Of course the ideal way to get a 100% speedup with the feldmann way of measuring >how good their algorithms are is a very bugged program that by accident can only >use 1 cpu, so that will get a 100% speedup then always whereas in my model it is >n times slower :) > >>Sincerely, >>Matt >> >> >>> >>>Now you have some examples of software written for fast latency shared memory >>>machines and then claim the thing is slower because the software isn't written >>>for such types of machines? >>> >>>That already should give you the answer. Writing parallel programs is 1 thing. >>>Writing something that works well without inventing numbers yourself is another >>>thing. >>> >>> >>>>On January 29, 2003 at 11:38:37, Vincent Diepeveen wrote: >>>> >>>>>On January 28, 2003 at 10:33:15, Robert Hyatt wrote: >>>>> >>>>>>On January 28, 2003 at 09:07:35, Vincent Diepeveen wrote: >>>>>> >>>>>>>On January 28, 2003 at 03:33:44, Mig Greengard wrote: >>>>>>> >>>>>>>>According to the tech I talked with, Amir and Shay were testing both machines >>>>>>>>before the match to see which one they would use. To my knowledge it wasn't >>>>>>>>decided until a day or two before the match. Obviously there isn't a big >>>>>>>>difference in performance. >>>>>>>> >>>>>>>>Saludos, Mig >>>>>>>>http://www.chessninja.com >>>>>>> >>>>>>>thanks. >>>>>>> >>>>>>>DIEP onto the 8 processor 1.6 would be running 16 processes and speed would >>>>>>>be about expressed in K7: >>>>>>> 8 x 1.6 Ghz / 1.4 = 9 Ghz >>>>>> >>>>>> >>>>>>No it wouldn't. You haven't tried an 8-way intel box yet. It doesn't scale >>>>>>nearly as well as the 2-way and 4-way intel boxes do. The chipset for >>>>>>supporting 8 cpus is simply not very good... >>>>> >>>>>DIEP isn't demanding much bandwidth Bob in case you missed it, it works >>>>>great on a cc-NUMA machine too. >>>> >>>>It demands _enough_ bandwidth. My comment wasn't only about "crafty" It was >>>>about the 8-way boxes in general. I ran on a dell 8450, with 8 700mhz xeon >>>>processors, and it was about 1.5X faster than my box. And again, _not_ with >>>>Crafty. I ran 8 copies of the same thing on the 8450, 4 copies on the quad, >>>>and compared the total run times. The 8450 was only about 50% faster when it >>>>should be 100% based on clock... >>>> >>>> >>>> >>>>> >>>>>>The 8-way box using the same clock speed for the processors will only be about >>>>>>1.5X faster than the 4-way box, and that doesn't count parallel search overhead >>>>>>at all. >>>>> >>>>>That's not true. It's 8 times faster for good software. Of course there is >>>>>algorithmic loss but there is no sequential loss unless the software sucks, >>>>>to say it rude. >>>> >>>>Have you ever run on one? Of course not. I have. So your "that's not true" >>>>is simply nonsense... There are _plenty_ of good benchmarks that can be used >>>>to draw conclusions about the 8-way memory bottleneck problem. >>>> >>>>It _might_ be 8x faster if you can fit in the L2 cache (this machine had >>>>2mb of L2 per processor compared to my 1mb on my quad 700). But if you have >>>>any memory bandwidth at all, it has a problem. And a 8-probe hash table is >>>>more than enough to highlight the problem. >>>> >>>> >>>> >>>> >>>>> >>>>>Doesn't say that it is easy to make software that can handle the latencies. >>>>> >>>>>It sure isn't easy to make a chessprogram that is having a good speedup >>>>>(without a too big sequential loss first like Zugzwang which was slowed down >>>>>first like 100 times or so in order to then have a decent speedup at like >>>>>256 processors; 50% speedup even incredible much i would be *very* happy with >>>>>around 15% already). >>>>> >>>>>But it is possible to make. >>>>> >>>>>DIEP is such a program that shows it can. DIEP runs like the sun on 8 cpu's >>>>>(2 nodes quad SGI), even at the slowest partitions (slowest latency speeds >>>>>are of course at the biggest partitions: 512 cpu partition). >>>>> >>>>>A 8 processor Xeon is hell for pc software like Fritz, Junior, Crafty, but it >>>>>is very good for DIEP. >>>>> >>>>>Best regards, >>>>>Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.