Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: General Judgement about Positiontests and Testing as such

Author: Sandro Necchi
Date: 22:47:33 06/23/04
On June 23, 2004 at 15:58:56, Rolf Tueschen wrote:

>On June 23, 2004 at 15:47:24, Sandro Necchi wrote:
>
>>On June 23, 2004 at 08:30:50, Uri Blass wrote:
>>
>>>On June 22, 2004 at 13:24:33, Sandro Necchi wrote:
>>>
>>>>On June 21, 2004 at 18:28:22, martin fierz wrote:
>>>>
>>>>>On June 21, 2004 at 13:50:11, Gian-Carlo Pascutto wrote:
>>>>>
>>>>>>On June 21, 2004 at 10:30:33, martin fierz wrote:
>>>>>>
>>>>>>>On June 20, 2004 at 02:56:08, Sandro Necchi wrote:
>>>>>>>
>>>>>>>>There is a simple way to verify if the "authors" are correct or not.
>>>>>>>>
>>>>>>>>They should state clearly how to evaluate all the solutions of the tests
>>>>>>>>comparing the hardware to the SSDF one, in order to create the Elo figure.
>>>>>>>>
>>>>>>>>Then by choosing the next release of 5 commercial programs which will be tested
>>>>>>>>by SSDF they have to predict the Elo for ALL 5 chess programs with a + - of 10
>>>>>>>>points.
>>>>>>>>
>>>>>>>>Than and indipendent tester should run the tests.
>>>>>>>>
>>>>>>>>If they fail, than they loose.
>>>>>>>>
>>>>>>>>Sandro
>>>>>>>
>>>>>>>+-10 elo, you must be kidding!
>>>>>>>the SSDF results themselves have larger error margins than that...
>>>>>>
>>>>>>Yes, but the ratinglists don't list errors and rank programs with smaller
>>>>>>differences than 10 ELO.
>>>>
>>>>Hi Martin,
>>>>
>>>>>
>>>>>that has nothing to do with this discussion. if the SSDF rating list, with a
>>>>>very computing-time-intensive testing methodology, produces ratings with
>>>>>typically +-30 error bars, you cannot expect a simple test suite to be any
>>>>>better. so you have to allow it a +-30 margin of error too, except if you want
>>>>>to claim that the test suite is better than the SSDF list, which i believe not
>>>>>even the most hardcore promoters of test suites would do.
>>>>
>>>>This is not fully correct because the more games you play in the SSDF list and
>>>>the error margin decrease, however if you take a look after the first Elo is
>>>>achived 95% of the programs, if not more, do not change the Elo by a high margin
>>>>+- 10 points so if it is true what the authors state that these test set are
>>>>able to estimate the program strenght is correct than they should be able to
>>>>give reliable figure or not?
>>>
>>>test suite cannot give estimate that is not wrong by more than 10 elo because
>>>only things like different time management and learning from previous searches
>>>in the game can change the rating by more than 10 elo.
>>
>>I have my own personal view based on more than 25 years experience on nearly all
>>chess programs which became available and very many experimenthal version, but
>>in this case I am trying to simulate a customer...a normal customer that wants
>>to know if the new program version is better than the previous one.
>>
>>so he can:
>>
>>1. Make test matches between 2 or more program to get an idea how much one
>>version is stronger than another one.
>
>
>Sandro,
>
>you are in a double bind. You ask as a customer and the customer asks which
>version is better, the new one he just bought????
>
>I mean, are you serious about this? Why should someone ask such questions? Why
>should someone do something to answer this question?

Well, if the WM test set can predict a program strenght, which I do not believe.
than it could be used to find out this quickly.

>
>I will tell you this. If I buy the new version of my favorite program I
>_e-x-p-e-c-t_ it to be stronger than the last version - - - for sure! If NOT I
>wouldn't have bought it.

OK, but some people do not want to discover this AFTER they bought it!

>
>So, I know that it is stronger, how much doesn't interest me. I want to play
>against it, I want to train, I want to analyse my own games with it.

OK, but if you use it as a sparring partner and a sort of a tutor, then the
stronger it is the best sparring partner/tutor it should be!


>You are not a customer.

I said I was trying to simulate a customer...


>You are an expert with a biased expert perception. Real users don't
>have such questions, which version is stronger.

Well, the reason why the majority of the people do not buy new versions often is
because they think they are about the same or the one they have is considered
just enough.

The majority of the people I know wants a stronger version and do not like to
wait to long to get it, so they "need" to know it is better/stronger.


>It's clear that the newest are stronger. Period.

Unfortunately it is not alwasy so as many users/chess fan finded out.
>
>:)

Sandro
>
>
>
>
>
>
>>
>>2. Play against the new program and find out personally, but in this case he
>>must be not a weak player as he would loose anyway.
>>
>>3. Run this test set and see the result.
>>
>>Now, since someone claim that you can estimate the program strenght by running
>>this test set, how is it possible if the +- figure is too wide?
>>
>>SEVERAL PEOPLE HERE TALK AND TALK AND TALK,  but do not make any proposal to
>>check this.
>>
>>Come on people and show how you can prove your statements!
>>
>>>>Now since you think different than me, what would be your proposal to find out
>>>>it they are correct or not?
>>>>
>>>>If you enlarge the Elo margin the whole test would not be meanful as how can one
>>>>knows if the new program version is better?
>>>>
>>>>Look Fritz, just to give an example, and how much it has increased in the SSDF
>>>>list one version to the next one. Can you verify if the new program is better
>>>>with a higher Elo margin?
>>>>
>>>>I do not think so.
>>>>
>>>>My is a proposal to find out, but if people prefer to talk only and be able to
>>>>say everything and the opposite, than there is no meaning to go on discussing
>>>>this matter.
>>>>
>>>>You see I like to solve problems and give solutions; I do not like to give only
>>>>words...
>>>>
>>>>>
>>>>>so now you have two numbers with error margins of +-30, which means that by
>>>>>error propagation their difference has a standard error of about 40 rating
>>>>>points (i.e. if you ran your own version of the SSDF list you would find rating
>>>>>differences up to 40 points between the two lists routinely).
>>>>>
>>>>>this shows that sandro's claim that the test suite should coincide with the SSDF
>>>>>by +-10 is ridiculous.
>>>>
>>>>If it is so, than make a better proposal...it is too easy to make critics...
>>>>
>>>>>i know i won't convince him, but i hope i can convince
>>>>>you ;-)
>>>>
>>>>You can convince me if you make good proposal...
>>>>
>>>>What we are trying to find out is:
>>>>
>>>>1. Can a test set allows a user to estimate a program strenght?
>>>>2. If yes, how can we find out this is true?
>>>>3. It must be without a too high margin as than it would be no meaningful. I
>>>>mean good enough to see the improvements between to program versions.
>>>
>>>
>>>A test may be good enough to see if A+1 is better than A and not good enough to
>>>see if A is better than B.
>>
>>How do you know it if the + - figures are too wide?
>>You mean better to solve the test set or better = stronger?
>>
>>>
>>>The important question for me as a programmer is if A+1 is better than A and not
>>>the exact difference in rating points or how much better.
>>
>>OK, I agree on this, but if the figure is too wide are you sure of the result?
>>
>>>
>>>Uri
>>
>>Sandro
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.