Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: General Objection Against CEGT Stats

Author: Rolf Tueschen

Date: 05:44:37 12/07/05

Go up one level in this thread


On December 07, 2005 at 08:24:33, George Tsavdaris wrote:

>On December 07, 2005 at 08:04:54, Rolf Tueschen wrote:
>
>>In other words. You never know exactly what you are really testing. Here in CEGT
>>it would be way better if you tested among the 500 amateurs. Then you will get a
>>ranking over time. But to test how a new engine like Rybka would do against
>>SHREDDER or FRITZ or CHESSMASTER, you must create a different testing. For that
>>question it only is disturbing noise to watch all the results of these 500
>>engines.
>>
>
>Statement-A:
>""Look at this: say these three top acts are incredibly stronger in chess
>strength than all th other 500 (which is apparently NOT the case in CEGT!) then
>what you are testing in such little 20 or so games matches? Are you really
>testing chess strength? I dont think so.""
>
>
>First: You said that yourself: "which is apparently NOT the case in CEGT!"
>So the testing is logical to be done this way......
>
>Second: Since all top engines play again each other and also against other
>weaker(we care only for the first in this case), the engine's strength testing
>is correct and we have not a situation described in your statement-A.
>
>Third: Since all non-top(weaker) engines play again each other and also against
>the top (we care only for the first in this case), the engine's strength testing
>is correct and we have not a situation described in your statement-A.



If you take Kasparov. You let him play 500 opponents of 1500-2200 Elo. Normally
he gets 100%! From his strength. But due to chance and other factors that are
more or less irrelevant he gets "only" 95%.

Also: my relativation was NOT meant this way that a reasonable number of amateur
programs would come close to SHREDDER or FRITZ. What I meant was that a few
could make a reasonable match. But my argument is totally ignored that if you
take these other 495 programs who are absolutely out of any reach, that they are
breaking the testing result importance through their irrelevance.

I thought that it was clear that we discussed chess strength and NOT the
stability of the engines over a longer testing from the mere technical view.
I dont know how to make it clearer. If normally you expect 100% results, 22-0,
then it is no abberation if you get 19-3 due to hardware failure or such mere
artefakts how we call it in stats.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.