Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: an idea to evaluate rating of programs based on pgn file of their games

Author: Robert Hyatt
Date: 13:32:13 08/09/01
On August 09, 2001 at 11:09:48, Uri Blass wrote:

>On August 09, 2001 at 09:58:01, Robert Hyatt wrote:
>
>>On August 09, 2001 at 09:13:18, Graham Laight wrote:
>>
>>>On August 09, 2001 at 08:54:51, Uri Blass wrote:
>>>
>>>>My idea is the following idea
>>>>
>>>>1)download a pgn of  6 games of a program at 2 hours/40 moves(for example some
>>>>of the ssdf games of Deep Fritz)
>>>>
>>>>2)choose a program that you want to use to evaluate the rating of chess
>>>>programs(I am going to call it program X)
>>>>Here is the explanation how to use it to evaluate the rating of deep fritz.
>>>>
>>>>3)give X to calculate for 1 hour every position when Deep Fritz had to move
>>>>4)build a table with 2 column when the first column is the time in seconds and
>>>>the second column is the number of solutions(number of positions when X suggest
>>>>the same move as Deep Fritz)
>>>>
>>>>It should be something like the following:
>>>>time           number of solutions
>>>>0-1 second           347 solutions
>>>>1-2 seconds          372 solutions
>>>>2-3 seconds          374 solutions
>>>>...
>>>>60-61 seconds        431 solutions
>>>>...
>>>>500-501 seconds      440 solutions
>>>>...
>>>>3599-3600 seconds    411 solutions
>>>>
>>>>if 500-501 seconds give the biggest number of solutions than it seems that
>>>>500-501 seconds of X is eqvivalent to tournament time control of Deep Fritz.
>>>>
>>>>It is possible to translate 500-501 seconds to a rating number and find rating
>>>>for Deep Fritz(Athlon1200)
>>>>Bigger numbers are better and it is possible to assume difference of 70 elo if
>>>>the number is twice bigger.
>>>>
>>>>It is also possible to use X's searches to evaluate rating of other programs
>>>>including X vy the same way
>>>>
>>>>I have some interesting questions:
>>>>
>>>>1)Do you expect the rating list based on this test and not based on results to
>>>>be biased for X or against X?
>>>>
>>>>2)What is the estimated rating of programs including Deeper blue, Deep blue,Cray
>>>>blitz,Deep thought based on this experiment?
>>>>
>>>>3)What is the estimated error that you expect to get in evaluating the rating of
>>>>programs by this way.
>>>
>>>At the risk of being negative, I think that, unfortunately, this experiment is
>>>likely to fail.
>>>
>>>Unless you can see all the way to the end of the game, you cannot say whether
>>>the move program X chose is better than the one DF chose.
>>>
>>>It might be just a matter of taste.
>>>
>>>It might be that both choices of move would win.
>>>
>>>It might be that Deep Fritz chose a poor move.
>>>
>>>DF might be better than X in some situations, but worse in others.
>>>
>>>I fear that, at the end of this experiment, the only result that you will obtain
>>>is the name of the program which is most similar in playing style to DF.
>>>
>>>-g
>>
>>
>>Very likely correct.  This is not an easy thing to do...  and trying to use
>>program X to predict the rating of program Y, based only on how many moves they
>>"match" looks statistically dangerous.
>
>similiar styles do not always mean stronger by my idea
>I will give an example
>case A:
>X and Y agreed on less than 20% of the mobves after 1,2,3...3599 seconds of
>search
>X and Y agreed on 20% of the moves after 3600 seconds of search
>
>X is going to evaluate Y as a very strong program because the maximal numbers of
>matches was achieved after 3600 seconds.

And this conclusion could be badly wrong.  IE what about the many cases we
have posted here where program X finds a move instantly due to positional
scores, while progray Y (which is clearly much stronger) takes much longer to
find it for tactical reasons?

I find it mind-numbing to think about trying to draw conclusions from such
stuff.  Because there are _lots_ of other reasons besides pure strength that
will cause the programs to match or disagree.



>
>Case B:
>X and Z agreed on 100% of the moves after 1 second
>X and Z agreed on 80% of the moves after 3600 seconds
>
>X is going to evaluate Z as a weak program because the maximal number of matches
>was achieved after 1 seconds.
>
>Uri



And again, it could be wrong.  Z might have lots of positional knowledge that
makes it match X even though X relied on search to find the moves.  But Z might
also have a much better search and at longer time controls it finds different
(and maybe better) moves for either positional or tactical reasons.  Again it
would be hard to decide why it differed...
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.