Author: Vincent Diepeveen
Date: 02:05:44 01/30/03
Go up one level in this thread
On January 29, 2003 at 23:31:19, Matthew Hull wrote: >On January 29, 2003 at 23:20:11, Vincent Diepeveen wrote: > >>On January 29, 2003 at 12:06:50, Robert Hyatt wrote: >> >>Bob let me explain to you. DIEP is written for machines which have a bit slower >>latency for global memory accesses. whereas the world champs 2002 version wasn't >>like that and would probably act like crafty on that 8 processor, the end of >>august 2002 versions and further are using a new type of parallellism which >>doesn't need much locking. Each processor takes care of itself without hurting >>bandwidth while searching too much. >> >>There is no dead slow global locks which is killing the 8 processor thing of >>course. >> >>therefore it works great for example at cc-NUMA machines and all types of Xeon >>machines. > > >Wow dude. Impressive. Could you supply some time-to-ply benchmarks for Diep on >8-way Xeon vis-a-vis 4-way Xeon. That would refute the proffessor like nothing >else. right now i have no system time at a 8 way Xeon. i do have numbers for 1 .. n processors at cc-NUMA SGI Origin 3800. Where i currently tested basically a lot up to 16 cpu's and plan to extend that with short tests to > 32 cpu's. It is not easy to get a good speedup with a lot of processors. But a good speedup is something completely else from a near to lineair speedup in nodes a second. Bob is complaining about his crafty not getting good nps a second. Well we already know that from the dual K7. single cpu 1 million nps or something and dual only 1.5 MLN nps there or something (forgive me a 100k nps more or less here; it just doesn't get a good sequential speedup even at it). So it is trivial that crafty *won't* run well on 8 cpu Xeon not to mention a cc-NUMA machine. Extending that claim then to other software as well i find a very bad taste. Saying it is very very hard to write software that runs well on them is a completely different statement from that. Speedup at the machines is a different matter of course. 8 processors for diep is the breaking number where the very well division of the x86 algorithm is not working that great anymore and where the well working division of the many processor splitting is still not working well. Special tuning would be needed how to split at 8 cpu's when compared to 2..4 IMHO. the way inwhich i split at x86 is working pretty ok at 8 cpu's but can be optimized further to avoid several milliseconds of cpu's getting idled. From my viewpoint optimizing the way DIEP splits is done for : 2 processors (x86 also works well for 4 processors) and n processors (n > 16). speedup at 16 processors currently not very good. of course that is the *absolute* speedup. No cheated toying like the Feldmann group who is focussing upon number of nodes searched instead of searchtimes. Of course the ideal way to get a 100% speedup with the feldmann way of measuring how good their algorithms are is a very bugged program that by accident can only use 1 cpu, so that will get a 100% speedup then always whereas in my model it is n times slower :) >Sincerely, >Matt > > >> >>Now you have some examples of software written for fast latency shared memory >>machines and then claim the thing is slower because the software isn't written >>for such types of machines? >> >>That already should give you the answer. Writing parallel programs is 1 thing. >>Writing something that works well without inventing numbers yourself is another >>thing. >> >> >>>On January 29, 2003 at 11:38:37, Vincent Diepeveen wrote: >>> >>>>On January 28, 2003 at 10:33:15, Robert Hyatt wrote: >>>> >>>>>On January 28, 2003 at 09:07:35, Vincent Diepeveen wrote: >>>>> >>>>>>On January 28, 2003 at 03:33:44, Mig Greengard wrote: >>>>>> >>>>>>>According to the tech I talked with, Amir and Shay were testing both machines >>>>>>>before the match to see which one they would use. To my knowledge it wasn't >>>>>>>decided until a day or two before the match. Obviously there isn't a big >>>>>>>difference in performance. >>>>>>> >>>>>>>Saludos, Mig >>>>>>>http://www.chessninja.com >>>>>> >>>>>>thanks. >>>>>> >>>>>>DIEP onto the 8 processor 1.6 would be running 16 processes and speed would >>>>>>be about expressed in K7: >>>>>> 8 x 1.6 Ghz / 1.4 = 9 Ghz >>>>> >>>>> >>>>>No it wouldn't. You haven't tried an 8-way intel box yet. It doesn't scale >>>>>nearly as well as the 2-way and 4-way intel boxes do. The chipset for >>>>>supporting 8 cpus is simply not very good... >>>> >>>>DIEP isn't demanding much bandwidth Bob in case you missed it, it works >>>>great on a cc-NUMA machine too. >>> >>>It demands _enough_ bandwidth. My comment wasn't only about "crafty" It was >>>about the 8-way boxes in general. I ran on a dell 8450, with 8 700mhz xeon >>>processors, and it was about 1.5X faster than my box. And again, _not_ with >>>Crafty. I ran 8 copies of the same thing on the 8450, 4 copies on the quad, >>>and compared the total run times. The 8450 was only about 50% faster when it >>>should be 100% based on clock... >>> >>> >>> >>>> >>>>>The 8-way box using the same clock speed for the processors will only be about >>>>>1.5X faster than the 4-way box, and that doesn't count parallel search overhead >>>>>at all. >>>> >>>>That's not true. It's 8 times faster for good software. Of course there is >>>>algorithmic loss but there is no sequential loss unless the software sucks, >>>>to say it rude. >>> >>>Have you ever run on one? Of course not. I have. So your "that's not true" >>>is simply nonsense... There are _plenty_ of good benchmarks that can be used >>>to draw conclusions about the 8-way memory bottleneck problem. >>> >>>It _might_ be 8x faster if you can fit in the L2 cache (this machine had >>>2mb of L2 per processor compared to my 1mb on my quad 700). But if you have >>>any memory bandwidth at all, it has a problem. And a 8-probe hash table is >>>more than enough to highlight the problem. >>> >>> >>> >>> >>>> >>>>Doesn't say that it is easy to make software that can handle the latencies. >>>> >>>>It sure isn't easy to make a chessprogram that is having a good speedup >>>>(without a too big sequential loss first like Zugzwang which was slowed down >>>>first like 100 times or so in order to then have a decent speedup at like >>>>256 processors; 50% speedup even incredible much i would be *very* happy with >>>>around 15% already). >>>> >>>>But it is possible to make. >>>> >>>>DIEP is such a program that shows it can. DIEP runs like the sun on 8 cpu's >>>>(2 nodes quad SGI), even at the slowest partitions (slowest latency speeds >>>>are of course at the biggest partitions: 512 cpu partition). >>>> >>>>A 8 processor Xeon is hell for pc software like Fritz, Junior, Crafty, but it >>>>is very good for DIEP. >>>> >>>>Best regards, >>>>Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.