Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Rating

Author: Eelco de Groot

Date: 04:55:25 01/20/06

Go up one level in this thread


On January 19, 2006 at 14:31:33, M Hurd wrote:

>On January 19, 2006 at 09:39:00, Eelco de Groot wrote:
>
>>On January 19, 2006 at 09:05:03, M Hurd wrote:
>>
>>>On January 19, 2006 at 08:52:00, Ricardo Gibert wrote:
>>>
>>>>On January 19, 2006 at 08:36:03, M Hurd wrote:
>>>>
>>>>>On January 19, 2006 at 08:30:55, Ricardo Gibert wrote:
>>>>>
>>>>>>On January 19, 2006 at 08:11:54, M Hurd wrote:
>>>>>>
>>>>>>>If you play an engine match of 1000 games against 1 engine and play another
>>>>>>>match of 1 game each against 1000 engines, would you get the same rating ?
>>>>>>>
>>>>>>>Is it more important to play as many different engines as possible or just
>>>>>>>number of games played.
>>>>>>
>>>>>>Depends on what your are trying to measure. Relative strength to one particular
>>>>>>engine or general strength against engines in general.
>>>>>>
>>>>>>>
>>>>>>>Presumably there will be an optimum number for games and number of engines
>>>>>>>played.
>>>>>>
>>>>>>Theoretically, the optimal number approaches infinity in both cases. Naturally,
>>>>>>this has virtually no practical value. You will need to be more specific to get
>>>>>>a more useable response.
>>>>>>
>>>>>>>
>>>>>>>Regards
>>>>>>>
>>>>>>>Mike
>>>>>
>>>>>
>>>>>Hi Ricardo
>>>>>
>>>>>I was simply wondering what would likely be the ELO difference between the 2
>>>>>matches I outlined and which match would be the more accurate.
>>>>
>>>>Accurate in what sense? The 2 matches answer 2 different questions. What
>>>>precisely are you trying to measure? My guess is you want to measure general
>>>>playing strength rather than the relative strength between 2 particular engines.
>>>>If that is the case, given those choices, this isn't a close call. One game
>>>>against each of 1000 different engines is the way to go.
>>>>
>>>>Frankly, this ought to be obvious.
>>>>
>>>>>
>>>>>Regards
>>>>>
>>>>>Mike
>>>
>>>
>>>Frankly this is not obvious to me.
>>>
>>>If you play 1 game with 1 engine versus another you will get a result however
>>>this could be a win loss or draw and tells you nothing. 1000 x nothing = nothing
>>>where as 1000 games against 1 engine should give a more confident rating.
>>>
>>>Regards
>>>
>>>Mike
>>
>>Hello Mike,
>>
>>That makes no difference, any game tells you just as much no matter which
>>opponent it is. For the rating (the TPR rating in this case) you simply compute
>>the average result against the average rating of all the opponents.
>>
>>You get a better idea of the strength against all the different opponents if you
>>play some (or just one) game against many of them, not just against one.
>>That is because a rating is not a perfect predictor, some players will just have
>>bad results against some of the possible opponents, their Angstgegners if you
>>like. Also the average opponent-rating is a more dependable number than the
>>rating of just one member of the group (there is less uncertainty involved
>>because more game were played to compute the average)
>>
>>The situation is a bit more complex if the rating of your opponent (programs) is
>>not very well known, or even unknown. Playing one or more games does not tell
>>you anything about rating then, only about the difference in rating between the
>>two. Therefore it becomes necessary to add to your tournament at least one but
>>preferably more opponents with a known rating, and let each of the unrated
>>players play against each other but also against the known ratings. Then you can
>>calculate all of the ratings with a succesive approximation process.
>>
>>hope it makes some sense..
>>
>> Eelco
>
>
>Thanks for the explanation.
>
>Hypertheticaly speaking Fritz plays Rybka 1000 times and a rating for fritz is
>calulated based on the results of the games assuming Rybka's rating is known.
>
>Fritz then plays 1 game against 1000 engines with known ratings and a rating is
>calculated. Which rating would be nearer to Fritz's likely rating or would they
>be the same, hypertheticaly speaking.
>
>Regards
>
>Mike

Hello Mike,

The answer is still the same, if you want a rating that helps predict the
outcome against many different opponents, your method number two is much better.
Playing 1000 games against Rybka will tell you very well the strength of Fritz
versus Rybka but that is not what you want to know. If Rybka plays exactly as
well against Fritz as against all other engines, then your two answers will be
almost the same, apart from the chance deviations. The "Rock beats Scissors
beats Paper beats Rock"-effect may be little usually but especially in the
computerchess world it happens that one program has good results aginst one but
worse against another program which you would not expect upon rating alone.

A secondary effect in practice is that although you know Rybka's rating with
small +/- range(in practice maybe not the best example since Rybka is still in
Beta stage) but say all 1000 engines including Rybka you could use have all
played a thousand games in the SSDF elolist and you use programs in the list as
opponents, their +/- ranges will change little anymore, but the average rating
of the thousand engines in the SSDF elolist is still more "steady" than the
rating of any single engine because there is almost no statistical deviation in
that average. If the SSDF list consists of just these thousand engines there is
zero statistical deviation in the average. But remember this number, average
SSDF rating, is only relative, if you want it to reflect strength against humans
for instance you have no other option than to play against humans too to
calibrate the SSDF against their human ratings.

I still have the feeling I could express and understand this better, but maybe
someone else can do that! Thanks for the question!
Eelco






This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.