Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Standard deviations -- how many games?

Author: Rolf Tueschen

Date: 18:04:06 01/23/04

Go up one level in this thread


On January 23, 2004 at 20:38:01, Dann Corbit wrote:

>On January 23, 2004 at 20:00:30, Rolf Tueschen wrote:
>
>>On January 23, 2004 at 18:33:52, Dann Corbit wrote:
>>
>>>On January 23, 2004 at 18:20:34, Russell Reagan wrote:
>>>
>>>>On January 23, 2004 at 15:24:31, Dann Corbit wrote:
>>>>
>>>>>30 experiments is a fairly standard rule as to when you should start to trust
>>>>>the results for experimental data.
>>>>
>>>>So what does this mean for chess engine matches? You need at least 30 games? Or
>>>>30 matches? If matches, how do you determine how long each match should be?
>>>
>>>It means less than 30 games and you cannot trust the answer.
>>>With more than 30 games, confidence rises.
>>>
>>>I bring up the number 30 because it is important in this case.  If you run (for
>>>instance) a 15 game contest, it would be dangerous to try to draw conclusions
>>>from it.  With 30 games or more, even something that does not perfectly model a
>>>normal distribution will start to conform to the right answers (e.g. the mean
>>>calculation will be about right.  The standard deviations will be about right
>>>unless sharply skewed).
>>>
>>>30 games is the break even limit where deficiencies in the choice of a normal
>>>distribution as a model start to become smoothed over.
>>
>>
>>About what measurements you are talking here? Of course the N is right for
>>normally distributed variables but what do you "measure" with chess games?
>
>+1, -1, 0

First, thanks for the night answer. So I can go to sleep with a good feeling.
In your second post you spoke of three "states". I have a question: are these
states equally probable? - Answer: of course NOT! What is the most probable?
Answer: one of them, the draw. As a direct chess variable.
So we have: a) 3 states as outcome for one [!] "measurement". b) a different
pre-event probability of the three outcomes and last but not least c) a shaky
distribution between the three states! Now the super question: what do you
measure, Dann? Three states at a time. Very odd that. I wouldn't speak of
measurement and stats at all! But you still believe in 30 "measurements"
although you don't have "measurements" and certainly no clearly defined variable
you define in advance! Very bad for a testing procedure.

NB I tell you that a pool of actually modern progs are all equally strong and
when you find differences that they are all by chance! Now prove me why after 30
games you can still make valid conclusions although you have no exactly defined
variable to "measure".

Could you help me to understand why all this could be met by the SSDF practice
of 40 games-matches???


>
>>Second question: when you have almost equally strong chess programs you are
>>implying that after 30 games you can make a sound conclusion which one is
>>stronger?
>
>No.  After 30 games you can start to believe the measurements.
>
>>- If you think you can answer with YES, then I doubt it.
>
>So do I.
>
>> Of course, if
>>you then do - what the SSDF is doing - matches between two unequal progs you can
>>well get clear results after 5 games.
>
>But only by accident.

I meant outdated progs against newest ones! The older with one, two or even
three generations on their back. I call this bullying.



>
>>But of course 30 games will be a good
>>profit for the better program.
>
>It will be a good start.
>
>>Also if tests have been known where a directly
>>concurring prog usually gets less points against this out-dated prog...
>>
>>What I want to say is this. It was often explained here in CCC. For good results
>>you must enter into some thousand games mode. 30 games is just for laughter. It
>>is an irrelevant species.
>
>30 games is the bare minimum number to the point where the number may have a
>tiny scrap of validity.  If you use less than that, you can be reporting pure
>nonesense.


Dann, let's get real. Almost nobody makes matches with only 30 games or less.
Fact is that the SSDF is testing with 40 games-matches. Tell me please now what
you think about their results. Also note that they are proudly presenting their
number one although the deviations are often bigger than the differences between
the top progs. What should we conclude from that practice?

Rolf


>
>This is especially the case because we know that it is not exactly a gaussian
>distribution.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.