Author: Robert Hyatt
Date: 10:15:38 01/30/03
Go up one level in this thread
On January 29, 2003 at 23:20:11, Vincent Diepeveen wrote: >On January 29, 2003 at 12:06:50, Robert Hyatt wrote: > >Bob let me explain to you. DIEP is written for machines which have a bit slower >latency for global memory accesses. whereas the world champs 2002 version wasn't >like that and would probably act like crafty on that 8 processor, Before you ramble on and on, re-read what I posted. I did _not_ just test "crafty" on that 8450. I tested _several_ different kinds of programs. Including one hand-coded number-cruncher that would fit entirely in L2. If you have _any_ significant memory bandwidth, the 8-way box is simply bad. The 4-way boxes use 4-way memory interleaving to offset the 4 cpus wanting something from memory at the same time. But the 8-way box does _not_ do 8-way interleaving, so the scaling drops off badly. > the end of >august 2002 versions and further are using a new type of parallellism which >doesn't need much locking. Each processor takes care of itself without hurting >bandwidth while searching too much. Locking is _not_ a performance issue in Crafty. I have told you this many times. Perhaps one day you will actually listen. > >There is no dead slow global locks which is killing the 8 processor thing of >course. In Crafty, I typically set/test that "dead slow global lock" about 2000 times in a 3 minute move. I don't believe that can even be measured in the miniscule amount of time it burns. Try another tack. > >therefore it works great for example at cc-NUMA machines and all types of Xeon >machines. > >Now you have some examples of software written for fast latency shared memory >machines and then claim the thing is slower because the software isn't written >for such types of machines? No. I simply ran _several_ classic programs. and most were _not_ threaded. As I said before. I ran one copy for 3 minutes. I then ran two separate copies for three minutes. I repeated until I got to 8 copies running. _zero_ smp locks. And by the time I got to 8, the test was taking about 4.5 minutes to run. You do the math. With 8 compute bound processes, it should have ran in 3 minutes every time. Conclusion was memory scaling. You should visit the linux-smp mailing list sometime. This kind of testing is pretty common to see what the _hardware_ can do without factoring in any threading overhead at all. > >That already should give you the answer. Writing parallel programs is 1 thing. >Writing something that works well without inventing numbers yourself is another >thing. Since you seem to be the one always "inventing numbers" I suppose you should know more about that statement than me. But I'm not talking about _parallel_ programs. I am talking about 8-way hardware efficiency, as I mentioned. Try again, once you read and understand what is being discussed. Or is this going the same way as the "the PIV can only cache 512mb of RAM" or "the PIV is much worse at branch prediction because it has fewer entries in the BTB" or some other such completely nonsensical statement that you have made??? One day you are actually going to say something that is _right_. I anxiously await that day. > > >>On January 29, 2003 at 11:38:37, Vincent Diepeveen wrote: >> >>>On January 28, 2003 at 10:33:15, Robert Hyatt wrote: >>> >>>>On January 28, 2003 at 09:07:35, Vincent Diepeveen wrote: >>>> >>>>>On January 28, 2003 at 03:33:44, Mig Greengard wrote: >>>>> >>>>>>According to the tech I talked with, Amir and Shay were testing both machines >>>>>>before the match to see which one they would use. To my knowledge it wasn't >>>>>>decided until a day or two before the match. Obviously there isn't a big >>>>>>difference in performance. >>>>>> >>>>>>Saludos, Mig >>>>>>http://www.chessninja.com >>>>> >>>>>thanks. >>>>> >>>>>DIEP onto the 8 processor 1.6 would be running 16 processes and speed would >>>>>be about expressed in K7: >>>>> 8 x 1.6 Ghz / 1.4 = 9 Ghz >>>> >>>> >>>>No it wouldn't. You haven't tried an 8-way intel box yet. It doesn't scale >>>>nearly as well as the 2-way and 4-way intel boxes do. The chipset for >>>>supporting 8 cpus is simply not very good... >>> >>>DIEP isn't demanding much bandwidth Bob in case you missed it, it works >>>great on a cc-NUMA machine too. >> >>It demands _enough_ bandwidth. My comment wasn't only about "crafty" It was >>about the 8-way boxes in general. I ran on a dell 8450, with 8 700mhz xeon >>processors, and it was about 1.5X faster than my box. And again, _not_ with >>Crafty. I ran 8 copies of the same thing on the 8450, 4 copies on the quad, >>and compared the total run times. The 8450 was only about 50% faster when it >>should be 100% based on clock... >> >> >> >>> >>>>The 8-way box using the same clock speed for the processors will only be about >>>>1.5X faster than the 4-way box, and that doesn't count parallel search overhead >>>>at all. >>> >>>That's not true. It's 8 times faster for good software. Of course there is >>>algorithmic loss but there is no sequential loss unless the software sucks, >>>to say it rude. >> >>Have you ever run on one? Of course not. I have. So your "that's not true" >>is simply nonsense... There are _plenty_ of good benchmarks that can be used >>to draw conclusions about the 8-way memory bottleneck problem. >> >>It _might_ be 8x faster if you can fit in the L2 cache (this machine had >>2mb of L2 per processor compared to my 1mb on my quad 700). But if you have >>any memory bandwidth at all, it has a problem. And a 8-probe hash table is >>more than enough to highlight the problem. >> >> >> >> >>> >>>Doesn't say that it is easy to make software that can handle the latencies. >>> >>>It sure isn't easy to make a chessprogram that is having a good speedup >>>(without a too big sequential loss first like Zugzwang which was slowed down >>>first like 100 times or so in order to then have a decent speedup at like >>>256 processors; 50% speedup even incredible much i would be *very* happy with >>>around 15% already). >>> >>>But it is possible to make. >>> >>>DIEP is such a program that shows it can. DIEP runs like the sun on 8 cpu's >>>(2 nodes quad SGI), even at the slowest partitions (slowest latency speeds >>>are of course at the biggest partitions: 512 cpu partition). >>> >>>A 8 processor Xeon is hell for pc software like Fritz, Junior, Crafty, but it >>>is very good for DIEP. >>> >>>Best regards, >>>Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.