Author: José Carlos
Date: 10:36:22 01/03/01
Go up one level in this thread
On January 03, 2001 at 12:25:42, Jeroen van Dorp wrote: >That's a very nice idea!! I like the challenge. >If I understand correctly, basically > >*method* >1. you want to find out how many games you have to play with a chess program >before results become "statistically relevant" >2. you want to find this out by letting the program play itself until the black >and white results even out, or at least are "within acceptable margins". Correct. >I think however the problem about statistical significance focuses around >something else: > >*what to determine* >1. how many games has an engine to play *against other engines* before you can >say something about *relative* strength, or… >2. how many games has an engine to play *against humans* before you can say >something about *relative* strength Ok, this is another problem (very important too). My idea was about: how many games between two programs do I need to know which is stronger? Only between programs, because if humans are involved, the statistical results become absolutely different. And only in a A vs B match, not in pools, because the experiment only involves 2 players. But, about what you say, I think human and computer ratings are essetially different. - In human games, psycholgy, happyness/sadness, personal problems, sympathy for the opponent... and many more things affect. And humans make big mistakes from time to time. - In computer games the stronger program will beat the weaker often, and no external factors will affect. So, I expect comp-comp ratings to be stable, while human-human ratings are much more variable. As well, a small change in a program can make it beat (if the change is right) it's old version most of the time, which would draw a strength difference bigger than expected. An example: suppose program A is rated 2000 in comp-comp and program B is rated 2400 in comp-comp. I'd not be suprised if these program would get, in a "human world" ratings of A:2100 and B:2300. It's only an opinion, anyway. >*relative strengt in pool of players* >So both definitions only get value if you can say something about *relative* >strenght in the fixed pool of players. You won't get that result letting play >the engine against itself. If an engine plays against itself it will only tell >you when statisical flaws stop occuring because of random book opening choice, >flaws in the engine in certain situations not always happening , machine faults >etc. >So I don't think you solve the right problem with this solution. I want to solve the problem of "how randomness affects in a comp-comp match". That's the only thing I can get from the experiment. >*white first move advantage* >You might solve another problem: that of white's first move advantage. Looking >at stats of human players (collectively, not individually) you'll see a rough >division of 37% wins by white, 34% draws and 29% losses by white. That doesn't >translate necessarily to the single situation, but can be the visualisation of >the small effect of the white first move advantage. >Here we truely have a *pool*, of two: white and black. Maybe we can draw some interesting conclusions about this. >*fixed pool* >The only test you could perform to solving that statistical problem of chess >engine strenght is IMO by making a *fixed* pool of chess engines and letting >them play endlessly against each other. >When changes in relative strenght are fading out, or are becoming too low to be >significant (say your predictions/win-lose stats will become accurate in 99% of >all following games) you have the figure you're looking for – for *THAT* pool of >players/engines. This is exactly an extension of my idea, involving a fixed pool, instead of a single pair. But in this case, you need the "endless" set of games to know what result to expect. In my idea, you know from the begining what to expect, and you concentrate on _when_ >(In my opinion these results are roughly available. But don't let us discuss >about SSDF again :)) :) >*example* >A common example of the strenght dynamics: You play 1000 games between engine >itself. Now stats are 38-33-29. They don't change significantly anymore. >So now you state 1000 games are needed for this/any engine to calculate its >strength. > >So you play a competition with a lot of engines/opponents of 1000 games After >these 1000 games one of the engine's rating is 2500. >Now there's a new engine/opponent. You play 1000 games against this new engine. >It's stronger. It wins all. The result is a drop op 400 rating points to 2100. >Now you return to the other pool without that new engine. You play 1000 games >again. Your start rating is 2100. After 1000 consecutive games it won a lot etc. >Because it started out at a weak 2100, its rating at the end will be 2650 >because of "relatively" better performance…. >Now the new engine is introduced in the pool. It starts without a rating. It >finds its nemesis. Its end rating is 2300, yet won all games against your >engine, rated 2500…. This happens often. >*parameter weight/influence* >So there's really no statistical flaw at all. What we lack is the *value* of the > parameters and the *effects* on the strenght of a chess engine. >Hash tables, endgame tablebases (do they make stonger/weaker) algorithm choice >etc. Why *do* results build up different from human strenght/rating buildup? >If you can pinpoint the effect of all those parameters on all aspects of the >game, and assign a value (weight) you don't need high game numbers anymore for >getting its strenght. >(It wouldn't surprise me if positional knowledge would came up with such an >analysis, but that's speculative) > >I'm interested in your opinion. Difficult to say. For any program, those parameters can affect in a different way. But we could try to estimate a value (and a standard deviation). Don't know in this moment, but I'll think about it :) >Jeroen ;-} José C.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.