Computer Chess Club Archives




Subject: Re: A question about statistics...

Author: Robert Hyatt

Date: 13:02:56 01/04/04

Go up one level in this thread

On January 04, 2004 at 13:30:34, Roger Brown wrote:

>>I think you need to do something in 60 minutes at least, plus some sort of
>>secondary time control or increment.
>Hello Dr. Robert Hyatt,
>That means one hour (plus an increment) per engine, right?

yes..  something in that range.

>This is of course quite distressing.  That timecontrol would yield a game in two
>hours, twelve games a day, eighty-four games in a week!  My computer will begin
>to cry, unborn children will start fidgeting, comets will fall....
>Tell me something, with all the results of the hundreds (thousands (?)) of games
>that your engine has played over the years, is it possible that you could
>extract a rating based on the short timecontrol games (unless 60 minutes is
>short - which it is for human games - in which event the experiment is not
>feasible) against the long timecontrol games?

The problem is that Crafty is not stable.  New releases come out sometimes
twice a week, so comparing old and new games would really be difficult to
interpret.  Commercial programs are easier since they come out once or
twice per year, and there is a longer time to play a significant number of games
with no changes whatsoever to the program.

>I could then take the upper bound of the short timecontrol games as a useful
>starting point for my test.
>Two hour games are not going to work on my machine...

All I can suggest, then, is to go shorter.  IE 10 minutes + 10 seconds per
move increment might be a reasonable start, since that will at least avoid
the <1 second moves near the end of a sudden-death time control.

>>If you look however, you will see an IM win the blitz events on ICC or
>>at other places, because blitz is simply a different game.
>Suggesting that a possible way forward is to construct a blitz ratings list and
>a separate longer timecontrol list.  Now that is going to create all sorts of
>>This depends on the strength of the two players.  The wider the gap, the
>>fewer games you need to play.  An easy example is to pick two players on ICC
>>and search for all games between them.  Pick one player's perspective and
>>record a win as 1, a dra as .5 and a loss as 0.  After you do a few hundred
>>such games, look at the string of results.  Do you see a consecutive
>>group you could pick that shows A to be stronger?  Another group that would
>>show B to be stronger?  That is what is wrong with a small sample-size.  You
>>might just start off at the front of either of those two groups, and if you
>>stop too soon, you get a biased result.
>Sorry to be such a bother but does the summary below make sense?
>(a)  Player x is much stronger than player Y - established by a historical
>examination of the engine's performances on some rating list.
>(b)  Play 100 games at 5 minutes.  Note the score.
>(c)  Play some reasonable number of games that will not crash my PC or cause the
>other users to rebel at a much higher timecontrol between A and B.
>(d)  Compare (b) and (c) and see if they map well, that is, are the results of
>this test a useful example of the predictive power of 5 minute (or some short,
>short timecontrol)games?  Does (c) map to (b) and do they map to the results of

No.  The problem again becomes the number of games.  To see what I mean, play a
200 game match, one minute per game.  Look at the 0-.5-1 results for a single
program.  If you look at the string carefully, you will likely see somewhere
where there is a string of 15+ losses to 5 wins, and then elsewhere you will
find the opposite.  If you play 20 games total, how do you know you didn't
get one of those odd sample sets???

>Does this sound reasonable?
>Of course, between new engines then it is not possible to conduct step (a) is
>it?  So I guess I will start with Crafty...
>>Between two programs, it can be very significant.  But you can answer this
>>with experimentation, A vs B with learning on, then A with learning on vs
>>B with learning off.
>Back to the original many games, what timecontrol etc.  Should I
>be able to do this experiemnt then it should answer the learning vs. not
>learning issue - as well as any issue to do with what is the minimum work which
>can be done in order to say something meaningful about the result of a match A
>vs B.

If I were testing this, I would do at least 200 games with learning and
200 without, and even then the margin of error could be very high if you
pick a program that is very close to Crafty's playing strength.  If you pick
one much worse or much better, then fewer games will do fine.

>Thanks for your time.

This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.