Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: General Judgement about Positiontests and Testing as such

Author: Rolf Tueschen

Date: 12:58:56 06/23/04

Go up one level in this thread


On June 23, 2004 at 15:47:24, Sandro Necchi wrote:

>On June 23, 2004 at 08:30:50, Uri Blass wrote:
>
>>On June 22, 2004 at 13:24:33, Sandro Necchi wrote:
>>
>>>On June 21, 2004 at 18:28:22, martin fierz wrote:
>>>
>>>>On June 21, 2004 at 13:50:11, Gian-Carlo Pascutto wrote:
>>>>
>>>>>On June 21, 2004 at 10:30:33, martin fierz wrote:
>>>>>
>>>>>>On June 20, 2004 at 02:56:08, Sandro Necchi wrote:
>>>>>>
>>>>>>>There is a simple way to verify if the "authors" are correct or not.
>>>>>>>
>>>>>>>They should state clearly how to evaluate all the solutions of the tests
>>>>>>>comparing the hardware to the SSDF one, in order to create the Elo figure.
>>>>>>>
>>>>>>>Then by choosing the next release of 5 commercial programs which will be tested
>>>>>>>by SSDF they have to predict the Elo for ALL 5 chess programs with a + - of 10
>>>>>>>points.
>>>>>>>
>>>>>>>Than and indipendent tester should run the tests.
>>>>>>>
>>>>>>>If they fail, than they loose.
>>>>>>>
>>>>>>>Sandro
>>>>>>
>>>>>>+-10 elo, you must be kidding!
>>>>>>the SSDF results themselves have larger error margins than that...
>>>>>
>>>>>Yes, but the ratinglists don't list errors and rank programs with smaller
>>>>>differences than 10 ELO.
>>>
>>>Hi Martin,
>>>
>>>>
>>>>that has nothing to do with this discussion. if the SSDF rating list, with a
>>>>very computing-time-intensive testing methodology, produces ratings with
>>>>typically +-30 error bars, you cannot expect a simple test suite to be any
>>>>better. so you have to allow it a +-30 margin of error too, except if you want
>>>>to claim that the test suite is better than the SSDF list, which i believe not
>>>>even the most hardcore promoters of test suites would do.
>>>
>>>This is not fully correct because the more games you play in the SSDF list and
>>>the error margin decrease, however if you take a look after the first Elo is
>>>achived 95% of the programs, if not more, do not change the Elo by a high margin
>>>+- 10 points so if it is true what the authors state that these test set are
>>>able to estimate the program strenght is correct than they should be able to
>>>give reliable figure or not?
>>
>>test suite cannot give estimate that is not wrong by more than 10 elo because
>>only things like different time management and learning from previous searches
>>in the game can change the rating by more than 10 elo.
>
>I have my own personal view based on more than 25 years experience on nearly all
>chess programs which became available and very many experimenthal version, but
>in this case I am trying to simulate a customer...a normal customer that wants
>to know if the new program version is better than the previous one.
>
>so he can:
>
>1. Make test matches between 2 or more program to get an idea how much one
>version is stronger than another one.


Sandro,

you are in a double bind. You ask as a customer and the customer asks which
version is better, the new one he just bought????

I mean, are you serious about this? Why should someone ask such questions? Why
should someone do something to answer this question?

I will tell you this. If I buy the new version of my favorite program I
_e-x-p-e-c-t_ it to be stronger than the last version - - - for sure! If NOT I
wouldn't have bought it.

So, I know that it is stronger, how much doesn't interest me. I want to play
against it, I want to train, I want to analyse my own games with it. You are not
a customer. You are an expert with a biased expert perception. Real users don't
have such questions, which version is stronger. It's clear that the newest are
stronger. Period.

:)






>
>2. Play against the new program and find out personally, but in this case he
>must be not a weak player as he would loose anyway.
>
>3. Run this test set and see the result.
>
>Now, since someone claim that you can estimate the program strenght by running
>this test set, how is it possible if the +- figure is too wide?
>
>SEVERAL PEOPLE HERE TALK AND TALK AND TALK,  but do not make any proposal to
>check this.
>
>Come on people and show how you can prove your statements!
>
>>>Now since you think different than me, what would be your proposal to find out
>>>it they are correct or not?
>>>
>>>If you enlarge the Elo margin the whole test would not be meanful as how can one
>>>knows if the new program version is better?
>>>
>>>Look Fritz, just to give an example, and how much it has increased in the SSDF
>>>list one version to the next one. Can you verify if the new program is better
>>>with a higher Elo margin?
>>>
>>>I do not think so.
>>>
>>>My is a proposal to find out, but if people prefer to talk only and be able to
>>>say everything and the opposite, than there is no meaning to go on discussing
>>>this matter.
>>>
>>>You see I like to solve problems and give solutions; I do not like to give only
>>>words...
>>>
>>>>
>>>>so now you have two numbers with error margins of +-30, which means that by
>>>>error propagation their difference has a standard error of about 40 rating
>>>>points (i.e. if you ran your own version of the SSDF list you would find rating
>>>>differences up to 40 points between the two lists routinely).
>>>>
>>>>this shows that sandro's claim that the test suite should coincide with the SSDF
>>>>by +-10 is ridiculous.
>>>
>>>If it is so, than make a better proposal...it is too easy to make critics...
>>>
>>>>i know i won't convince him, but i hope i can convince
>>>>you ;-)
>>>
>>>You can convince me if you make good proposal...
>>>
>>>What we are trying to find out is:
>>>
>>>1. Can a test set allows a user to estimate a program strenght?
>>>2. If yes, how can we find out this is true?
>>>3. It must be without a too high margin as than it would be no meaningful. I
>>>mean good enough to see the improvements between to program versions.
>>
>>
>>A test may be good enough to see if A+1 is better than A and not good enough to
>>see if A is better than B.
>
>How do you know it if the + - figures are too wide?
>You mean better to solve the test set or better = stronger?
>
>>
>>The important question for me as a programmer is if A+1 is better than A and not
>>the exact difference in rating points or how much better.
>
>OK, I agree on this, but if the figure is too wide are you sure of the result?
>
>>
>>Uri
>
>Sandro



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.