Author: Don Dailey
Date: 17:16:44 12/12/97
Go up one level in this thread
On December 12, 1997 at 18:22:48, Bruce Moreland wrote: > >On December 12, 1997 at 14:29:47, Willie Wood wrote: > >>That's quite a turnaround for a bit of "fiddling." I saw a couple of >>those games last night, and it didn't look to be any contest (assuming >>Klamath is mcp7). >> >>I guess those results represent too small a sample to be significant. >>How many games do you think are required to get a good sample? I'm >>interested because, in developing my own program, I want to know how >>much testing is usually done to test program changes. Seems like 10 >>games is not enough. > >The fiddling didn't do it. My program lost another one this morning for >the same reason it lost the other two -- it let a passer get too far >advanced. Perhaps it lost one at the WMCCC (against Junior) for the >same reason. Obviously something I can do better. But to say that I >did some magic thing to Ferret that made it win a few in a row would be >wrong. I just got sick of that passer problem and tried to fix it in >the middle of the day. I failed, apparently, since it happened again >today. > >Note please, in the interests of fairness, that my computer is >significantly faster than the one running MChess. I have a 533 mhz >Alpha, and Klamath has a 300 Mhz P2. My Alpha is like 30% faster, I >don't remember the exact figure. Mine is also automatic so I pick up a >few seconds now and then, and I don't try funky RxB experiments :-) > >The number of games you need to prove a point depends upon the point you >are trying to prove, and the definition of "prove" that you are using. > >If you are trying to prove that program A plays more interesting chess >than program B, you have to rely on your own eyeball. > >If you want to prove that program A will beat program B at least 51% of >the time on equal hardware, the number of games you have to play depends >upon the result you get in the games. > >You can figure out the probability that two equal programs would >generate a given result. If this probability falls below some >threshold, like 5% or 2% or 1%, depending upon how picky you are, you >can say that program A beats program B most of the time, with pretty >good confidence. > >If the first few games produce an extremely lop-sided result, for >instance 4-0, you can figure out the odds of this happening by chance. > >For instance, assume that 40% of games are draws and 35% are won by >white and 25% are won by black. Assuming you get a 4-0 result, and play >white twice and black twice, the odds of this happening are 0.35 * 0.35 >* 0.25 * 0.25, which is 0.00765625, which is a surprisingly small >number. > >Assuming my estimate of winning and drawing percentages above is >accurate (I have no idea if it is), this means that you will get a 4-0 >result for a particular program less than 1% of the time, if the >programs are in fact equal in strength. > >Note that this isn't showing that A is a lot better than B, it is just >showing that A is at least a little better than B. If you want to show >that A is a lot better than B, you'd need to do some different math. > >Sometimes your conclusion will be wrong, but it should be fairly rare >that this is the case. > >If you go through a longer match, and look for a string of four wins in >a row, you haven't proven that one program is better if you find it, >because you might just be picking the nicest cherries out of the basket. > >If your result isn't as lop-sided as 4-0, you can find that you need >*tremendously* long matches to prove that one program is better than >another one. I have done very long matches (hundreds of games) between >programs, and even though one side wins distinctly more games, I can't >prove that one program is better than the other. If you see an apparent >edge for one program, but you determine that it is 25% likely that the >program that scored worse is actually the better program, how can you >feel great about the result? I can't. Most people are overly impressed by small matches. It turns out that you need a whole lot of games, hundreds in fact to measure the relative strengths of 2 players. If one player is a lot stronger, it will become obvious quicker, but to quantify HOW MUCH stronger will still require a lot of games. It is really difficult and time consuming to figure out which of two nearly equal programs are strongest. Also, in some cases there may be a non-transitive relationship where A beat B, B beats C and yet A cannot beat C. >I haven't done the math today, but intuitively I would be very >mis-trustful of results like 55-45, for instance. It would be easy to >say, oh, the one program must be better than the other, but this isn't >the case. How right you are! In testing I've done, I have seen scores more lopsided than this based on 100 games or so turn completely around after another 200 games or so. It's to be expected. I remember a company once years ago producing a glossy document descibing and annotating a 10 game match between it's new chess computer and the previuos model. It was Morphy vs the spraklen 2.5 program if I remember correctly. The match result was 7 to 3 in favor of the new thing. But this kind of result is even less reliable than the actual score would suggest because you have to realize that they would not even be printing the results had it not gone their way! This advertising probably did appeal to the general public and it was fun reading (and maybe Morphy really was stronger) but I definitely took it with a grain of salt! >Some people think that they can tell which of two programs is stronger, >by eyeball, but I'm mistrustful of this. If one program beats another >7-3 (a statistically insignificant result), which one do you think they >will pick? How often do you see computer vs computer losses in which >the loser looks good? I'm extremely distrustful of this too. Another anecdote: Me and some friends were playing around with the super constellation and then one of the Richard lang Mephisto things years agon. They were strong chess players. After lot's of fun speed games, human vs computer, they gravitated toward the Super constellation, believing it to be much stronger. But I remember that their results were better against the Constellation. I told them the Mephisto program was much better (based on many games I had played between them.) They did not believe me. So we played 3 or 4 games between the 2 computers and the Mephisto kept winning. They believed the Mephisto was luckier and didn't like it's style of play! In their own games they discounted their losses to the Mephisto (you know, "I just have won that game") and were impressed by the Sacrafices the Constellation would often play. People are not very good judges of things like this and they often prefer style to substance. I rarely trust results people report to me. -- Don
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.