Computer Chess Club Archives




Subject: Re: A question about statistics...

Author: Mike Byrne

Date: 09:21:56 01/04/04

Go up one level in this thread

On January 04, 2004 at 11:46:00, Roger Brown wrote:

>Hello all,
>I have read numerous posts about the validity - or lack thereof actually - of
>short matches between and among chess engines.  The arguments of those who say
>that such matches are meaningless (Kurt Utzinger, Christopher Theron, Robert
>Hyatt et al)typically indicate that well over 200 games are requires to make any
>sort of statisticdal statement that engine X is better than engine Y.
>I concede this point.
>The arguments of the short match exponents typically centre on other
>chessplaying characteristics of an engine which may also be of  interest to a
>user - tactical excitement, daring, amazing moves, positional considerations,
>human like play etc.
>I also agree that this camp has a valid perspective.
>I would like  to conduct an experiment but I need to ask a few questions first:
>(1)  Is there a minimum timecontrol that is satistically relevant to games
>played at classical timecontrols?  That was really one of the things I wanted to
>look at but clearly it requires a pool of such games, consistent hardware, etc.
>I ask this because the long timecontrol devotees have spare hardware, or at
>least hardware over which they exercise an enormous amount of discretion as to
>its use.  Not all of us are in that fortunate position.
>Playing 200 games or more at 60 minutes + (which is still fast chess!) would
>take me to a place where the light does not shine...
>I am thinking that there may be a relationship - particularly as the subject is
>an electronic construct - between long games and short ones.  It may not be
>linear but I cannot believe that it is a coincidence that the long timecontrol
>GMs are also atop the blitz ratings ladder...

there might be some correlation, but imo, it's mot close enough ...statistcally,
what every time control you tested is the one that is valid for claim that you
make, i.e., you test 5 0 blitz, you can say at 5 0 blitz on my machine, I have a
95% confidence level that "xyz" program is better than "zyx" program
cannot make a claim at 40/2 when all your games are at 5/game ...

>(2)  What is the statistical minimum of games that I would have to play to be
>able to make some sort of definitive noise?

When you play 100 games, the confidence level will be near 95% +/- 60 points.
So you end up with a difference that is greater than 120 points you say with 95%
confidence that the higher rated program is better than the lower rated program.
 So there would be a  1 out of 20 chance that your test results are not correct.
 If they are rated less than  120 points apart, statically, you can make a claim
either way, you need more games.  The closer the rating is , the more games you
need to achive a 95% confidence level,

>(3)  What is the impact - or theoretical impact - of learning on such a match?
>My personal bias is that if an author implements learning he should be rewarded
>for it and it should be turned on at the beginning of the match.  This speaks to
>positional and book learning.

If learning is implemented correctly, it should help.  I have no idea how much
it would help - I would think it would be may be worth 50 elo points - but that
is just a guess.

>(4)  I am also biased towards using the engine's particular book(s).  The
>opening knowledge that a human chessplayer has is his/hers.  An engine should
>have its own book with it as it goes into battle.  Can someone turn off Ms.
>Polgar's opening book?  No?  Then the engine should have its book too....

I can go either way.  To evaluate a program as sold, yes, use the book.  To
evaluate the chess playing ability  with a generic book is good to get an
opinion of the engine, is a differnt approach.

>(5)  The games would be played on my single processor CPU.  That would mean no
>pondering *if* I understand Robert Hyatt's reasoning on the matter (which I
>freely admit may not be the case at all!).

Single machine with single CPU - pondering shouldbe off.  Right away that can
create differences with those run two machine , or dual CPUS - pondering cab
greatly influnce the outcome of a game, but IMO, you have no other choice.

>(6)  Are there any other factors?

Statistically, the 100 games does have NOT to be against the same engine,,  You
can design a tournament with 11 engines, have them play round robin with rounds
(10 cycles) - and you confidence level for the ratings relelative to each is
just as valid as single 100 game match.  Alternativley, you can play 100 rounds
with 101 engines, they each play one game with each other - statisically it is
the same.  The key is simply 100 games.  Think about human ratings, you do not
play 100 games against the same player, you might play 100 games against 90
players - statistically it does not matter.

>I really would like a way to prove or disprove the position that:
>(1) Games at shorter timecontrols are essentially worthless and:
>(2) That matches of 1000 games are required to make statistical statements.
>Please feel free to comment BUT what I would really like are some answers to the
>above questions and/or pointers....

This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.