Author: Vincent Diepeveen
Date: 08:56:48 09/03/02
We all know how many failures the past years parallel programs have been when developed by scientists. This years diep show at the teras was no exception to that. The 3 days preparation time i had to get to the machine (and up to 5 days before tournament i wasn't sure whether i would get system time *anyway*). However sponsors want to hear how well your thing did. At a 1024 processor machine (maximum allocation 512 processors within 1 partition of shared memory) from which you get 60 with bandwidth of the memory 2 times slower than local ram, and let's not even *start* to discuss the latency otherwise you will never start to fear diep using that machine. All i can say about it is that the 20 times slowed down Zugzwang was at 1999 at a machine with faster latency... I'm working hard now to get a DIEP DTS NUMA version ready. DTS it is because it is dynamic splitting wherever it wants to. Work for over a month fulltime has been done now. Tests at a dual K7 as well as dual supercomputer processors have been very positive. Nevertheless i worried about how to report about it. So i checked out the article from Robert Hyatt again. Already in 1999 when i had implemented a pc-DTS version i wondered why i never got near the speeds of bob when i was not forward pruning other than nullmove. The 1999 world champs version i had great speedups, but i could all explain them by forward pruning which i was using at the time. Never i got close even dual xeon or quad xeon to speeds reported by Bob in his DTS version described 1997. I concluded that it had to do with a number of things, encouraged by Bob's statements. In 99 bob explained that splitting was very cheap at the cray. He copied a block with all data of 64KB from processor 0 to P1 within 1 clock at the cray. I didn't know much of crays or supercomputers at the time, except that they were out of my budget so i believed it. However i have a good memory for certain numbers, so i have remembered his statement very well. In 2002 Bob explained the cray could copy 16 bytes each clock. A BIG contradiction to his 1999 statement. No one here will wonder about that, because regarding deep blue we have already seen hundreds of contradicting statements from bob. Anyway, that makes splitting at the cray of course very expensive, considering bob copied 64KB data for each split. Crafty is no exception here. I never believed the 2.0 speedup in his tabel at page 16 for 2 processors, because if i do a similar test i sometimes get also > 2.0, usually less. Singular extensiosn hurted diep's speedup incredible, but even today i cannot get within a few minutes get to the speedup bob achieved in his 1997 article. In 1999 i wondered about why his speedup was so good. So Bob concluded he splitted in a smarter way when i asked. Then i asked obviously how he splitted in cray blitz, because what bob is doing in crafty is too horrible for DIEP to get a speedup much above 1.5 anyway. I asked obviously how he splitted in cray blitz. The answer was: "do some statistical analysis yourself on game trees to find a way to split well it can't be hard, i could do it too in cray blitz but my source code is gone. No one has it anymore". So you can feel my surprise when he suddenly had data of crafty versus cray blitz after 1999, which bob quotes till today into CCC to proof how well his thing was. Anyway, i can analyze games as FM, so i already knew a bit about how well this cray blitz was. I never paid much attention to the lies of bob here. I thought he was doing this in order to save himself time digging up old source code. Now after a month of fulltime work at DIEP at the supercomputer and having it working great at a dual (and very little overhead) but still a bad speedup i started worrying about my speedup and future article to write about it. So a possible explanation for the bad speedup of todays software when compared to bob's thing in 1993 and writing about it in 1997 is perhaps explained by nullmove. Bob still denies this despite a lot of statistical data at loads of positions (150 positions in total tried) with CRAFTY even. Bob doesn't find that significant results. Also he says that not a single of MY tests is valid because i have a stupid PC with 2 processors and bad RAM. a dual would hurt crafties performance too much. This because i concluded also that the speedup crafty gets here is between 1.01 and 1.6 and not 1.7. Data suggests that crafties speedup at his own quad is about 2.8, where he claims 3.1. Then bob referred back to his 1997 thesis that the testmethod wasn't good. Because to get that 2.8 we used cleared hashtables and in his thesis he cheats a little by not clearing the tables at all. to simulate a game playing environment that's ok of course. However there is a small problem with his article. The search times and speedup numbers are complete fraud. If i divide the times of 1 cpu by the speedup bob claims he has, i get perfect numbers nearly. Here is the result for the first 10 positions based upon bob's article march 1997 in icca issue #1 that year, the tables with the results are on page 16: When diep searches at a position it is always a weird number. If i claim a speedup of 1.8 then it is usually 1.7653 or 1.7920 or 1.8402 and so on. Not with bob. Bob knows nothing from statistical analysis of data (i must claim innocent here too but i am at least not STUPID like bob here): pos 2 4 8 16 1 2.0000 3.40 6.50 9.09 2 2.00 3.60 6.50 10.39 3 2.0000 3.70 7.01 13.69 4 2.0000 3.90 6.61 11.09 5 2.0000 3.6000 6.51 8.98876 6 2.0000 3.70 6.40 9.50000 7 1.90 3.60 6.91 10.096 8 2.000 3.700 7.00 10.6985 9 2.0000 3.60 6.20 9.8994975 = 9.90 10 2.000 3.80 7.300 13.000000000000000 This clearly PROOFS that he has cheated completely about all search times from 1 processor to 8 processors. Of course now that i am running myself at supercomputers i know what is the problem. I only needed a 30 minute look a month ago to see what is in crafty the problem and most likely that was in cray blitz also the problem. The problem is that crafty copies 44KB data or so (cray blitz 64KB) and while doing that it is using smp_lock. That's too costly with more than 2 cpu's. This shows he completely lied about his speedups. All times from 1-8 cpu's are complete fraud. There is however also evidence he didn't compare the same versions. Cray Blitz node counts are also weird. The more processors you use the more overhead you have obviously. Please don't get mad at me for calculating it in the next simple but very convincing way. I will do it only for his first node counts at 1..16 cpu's, the formula is: (nodes / speedup_i-cpu's ) * speedup_i+1_cpu's 1 to 2 cpu's we don't need the math. If you need exactly 2 times shorter to get to it but thereby you need more nodes at more cpu's (where you need expensive splits) then that's already weird of course, though not impossible. 2 to 4 cpu's: 3.4 * (89052012 / 2.0) = 151388420.4 nodes. bob needed: 105.025.123 which in itself is possible. Simply like 40% overhead extra for 4 processors which 2 do not have. This is very well possible. 4 to 8 cpu's: 6.5 * 105025123 nodes / 3.4 = 200.783.323 bob needed: 109MLN nodes That means at 8 cpu's the overhead is already approaching 100% rapidly. This is very well possible. The more cpu's the bigger the overhead. 8 to 16 cpu's: 9.1 * (109467495 / 6.5) = 153254493 bob needed: 155.514.410 My dear fellow programmers. This is impossible. Where is the overhead? The factor 100% at least overhead? More likely factor 3 overhead. The only explanation i can come up with is that the node counts from 2..8 processors are created by a different version from Cray Blitz than the 16 processor version. From the single cpu version we already know the number of nodes gotta be weird because it is using a smaller hashtable (see page 4.1 in the article second line there after 'testing methodology'). We talk about mass fraud here. Of course it is 5 years ago this article and i do not know whether he created the table in 1993. How am i going to tell my sponsor that my speedup won't be the same as that from the 1997 article? To whom do i compare, zugzwang? 'only' had on paper 50% speedup out of 512 processors. Of course also something which is not realistic. However Feldmann documented most of the things he did in order to cripple zugzwang to get a better speedup. A well known trick is to kick out nullmove and only use normal alfabeta instead of PVS or other forms of search. Even deep blue did that :) But what do you guys think from this alternative book keeping from Bob? Best regards, Vincent
This page took 0.04 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.