Author: Jeroen van Dorp
Date: 09:25:42 01/03/01
Go up one level in this thread
That's a very nice idea!! I like the challenge. If I understand correctly, basically *method* 1. you want to find out how many games you have to play with a chess program before results become "statistically relevant" 2. you want to find this out by letting the program play itself until the black and white results even out, or at least are "within acceptable margins". I think however the problem about statistical significance focuses around something else: *what to determine* 1. how many games has an engine to play *against other engines* before you can say something about *relative* strength, or… 2. how many games has an engine to play *against humans* before you can say something about *relative* strength *relative strengt in pool of players* So both definitions only get value if you can say something about *relative* strenght in the fixed pool of players. You won't get that result letting play the engine against itself. If an engine plays against itself it will only tell you when statisical flaws stop occuring because of random book opening choice, flaws in the engine in certain situations not always happening , machine faults etc. So I don't think you solve the right problem with this solution. *white first move advantage* You might solve another problem: that of white's first move advantage. Looking at stats of human players (collectively, not individually) you'll see a rough division of 37% wins by white, 34% draws and 29% losses by white. That doesn't translate necessarily to the single situation, but can be the visualisation of the small effect of the white first move advantage. Here we truely have a *pool*, of two: white and black. *fixed pool* The only test you could perform to solving that statistical problem of chess engine strenght is IMO by making a *fixed* pool of chess engines and letting them play endlessly against each other. When changes in relative strenght are fading out, or are becoming too low to be significant (say your predictions/win-lose stats will become accurate in 99% of all following games) you have the figure you're looking for – for *THAT* pool of players/engines. (In my opinion these results are roughly available. But don't let us discuss about SSDF again :)) *example* A common example of the strenght dynamics: You play 1000 games between engine itself. Now stats are 38-33-29. They don't change significantly anymore. So now you state 1000 games are needed for this/any engine to calculate its strength. So you play a competition with a lot of engines/opponents of 1000 games After these 1000 games one of the engine's rating is 2500. Now there's a new engine/opponent. You play 1000 games against this new engine. It's stronger. It wins all. The result is a drop op 400 rating points to 2100. Now you return to the other pool without that new engine. You play 1000 games again. Your start rating is 2100. After 1000 consecutive games it won a lot etc. Because it started out at a weak 2100, its rating at the end will be 2650 because of "relatively" better performance…. Now the new engine is introduced in the pool. It starts without a rating. It finds its nemesis. Its end rating is 2300, yet won all games against your engine, rated 2500…. *parameter weight/influence* So there's really no statistical flaw at all. What we lack is the *value* of the parameters and the *effects* on the strenght of a chess engine. Hash tables, endgame tablebases (do they make stonger/weaker) algorithm choice etc. Why *do* results build up different from human strenght/rating buildup? If you can pinpoint the effect of all those parameters on all aspects of the game, and assign a value (weight) you don't need high game numbers anymore for getting its strenght. (It wouldn't surprise me if positional knowledge would came up with such an analysis, but that's speculative) I'm interested in your opinion. Jeroen ;-}
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.