Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: General Judgement about Positiontests and Testing as such

Author: Sandro Necchi

Date: 12:47:24 06/23/04

Go up one level in this thread


On June 23, 2004 at 08:30:50, Uri Blass wrote:

>On June 22, 2004 at 13:24:33, Sandro Necchi wrote:
>
>>On June 21, 2004 at 18:28:22, martin fierz wrote:
>>
>>>On June 21, 2004 at 13:50:11, Gian-Carlo Pascutto wrote:
>>>
>>>>On June 21, 2004 at 10:30:33, martin fierz wrote:
>>>>
>>>>>On June 20, 2004 at 02:56:08, Sandro Necchi wrote:
>>>>>
>>>>>>There is a simple way to verify if the "authors" are correct or not.
>>>>>>
>>>>>>They should state clearly how to evaluate all the solutions of the tests
>>>>>>comparing the hardware to the SSDF one, in order to create the Elo figure.
>>>>>>
>>>>>>Then by choosing the next release of 5 commercial programs which will be tested
>>>>>>by SSDF they have to predict the Elo for ALL 5 chess programs with a + - of 10
>>>>>>points.
>>>>>>
>>>>>>Than and indipendent tester should run the tests.
>>>>>>
>>>>>>If they fail, than they loose.
>>>>>>
>>>>>>Sandro
>>>>>
>>>>>+-10 elo, you must be kidding!
>>>>>the SSDF results themselves have larger error margins than that...
>>>>
>>>>Yes, but the ratinglists don't list errors and rank programs with smaller
>>>>differences than 10 ELO.
>>
>>Hi Martin,
>>
>>>
>>>that has nothing to do with this discussion. if the SSDF rating list, with a
>>>very computing-time-intensive testing methodology, produces ratings with
>>>typically +-30 error bars, you cannot expect a simple test suite to be any
>>>better. so you have to allow it a +-30 margin of error too, except if you want
>>>to claim that the test suite is better than the SSDF list, which i believe not
>>>even the most hardcore promoters of test suites would do.
>>
>>This is not fully correct because the more games you play in the SSDF list and
>>the error margin decrease, however if you take a look after the first Elo is
>>achived 95% of the programs, if not more, do not change the Elo by a high margin
>>+- 10 points so if it is true what the authors state that these test set are
>>able to estimate the program strenght is correct than they should be able to
>>give reliable figure or not?
>
>test suite cannot give estimate that is not wrong by more than 10 elo because
>only things like different time management and learning from previous searches
>in the game can change the rating by more than 10 elo.

I have my own personal view based on more than 25 years experience on nearly all
chess programs which became available and very many experimenthal version, but
in this case I am trying to simulate a customer...a normal customer that wants
to know if the new program version is better than the previous one.

so he can:

1. Make test matches between 2 or more program to get an idea how much one
version is stronger than another one.

2. Play against the new program and find out personally, but in this case he
must be not a weak player as he would loose anyway.

3. Run this test set and see the result.

Now, since someone claim that you can estimate the program strenght by running
this test set, how is it possible if the +- figure is too wide?

SEVERAL PEOPLE HERE TALK AND TALK AND TALK,  but do not make any proposal to
check this.

Come on people and show how you can prove your statements!

>>Now since you think different than me, what would be your proposal to find out
>>it they are correct or not?
>>
>>If you enlarge the Elo margin the whole test would not be meanful as how can one
>>knows if the new program version is better?
>>
>>Look Fritz, just to give an example, and how much it has increased in the SSDF
>>list one version to the next one. Can you verify if the new program is better
>>with a higher Elo margin?
>>
>>I do not think so.
>>
>>My is a proposal to find out, but if people prefer to talk only and be able to
>>say everything and the opposite, than there is no meaning to go on discussing
>>this matter.
>>
>>You see I like to solve problems and give solutions; I do not like to give only
>>words...
>>
>>>
>>>so now you have two numbers with error margins of +-30, which means that by
>>>error propagation their difference has a standard error of about 40 rating
>>>points (i.e. if you ran your own version of the SSDF list you would find rating
>>>differences up to 40 points between the two lists routinely).
>>>
>>>this shows that sandro's claim that the test suite should coincide with the SSDF
>>>by +-10 is ridiculous.
>>
>>If it is so, than make a better proposal...it is too easy to make critics...
>>
>>>i know i won't convince him, but i hope i can convince
>>>you ;-)
>>
>>You can convince me if you make good proposal...
>>
>>What we are trying to find out is:
>>
>>1. Can a test set allows a user to estimate a program strenght?
>>2. If yes, how can we find out this is true?
>>3. It must be without a too high margin as than it would be no meaningful. I
>>mean good enough to see the improvements between to program versions.
>
>
>A test may be good enough to see if A+1 is better than A and not good enough to
>see if A is better than B.

How do you know it if the + - figures are too wide?
You mean better to solve the test set or better = stronger?

>
>The important question for me as a programmer is if A+1 is better than A and not
>the exact difference in rating points or how much better.

OK, I agree on this, but if the figure is too wide are you sure of the result?

>
>Uri

Sandro




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.