Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Calculating Computer Ratings????

Author: Shaun Graham

Date: 21:57:23 08/02/98

Go up one level in this thread


On August 02, 1998 at 09:49:34, Robert Hyatt wrote:

>On August 01, 1998 at 01:14:57, Shaun Graham wrote:
>
>>
>>>Well if you would assign a rating of 2400 to all of your opponents, why wouldn't
>>>you do this to your program?
>>
>>In fact you do first assign your program a rating of 2400, but then you see how
>>your program a 2400 has performed against all other 2400's to get the new
>>rating.
>>
>>
>> I admit that when the rating system was first
>>>started, the ratings had to be assigned for at least 1 program to start it off,
>>>but now that we have established ratings (SSDF for example) why would we need to
>>>assign ratings?
>>
>>You need to assign ratings only because here the attempt is being more to make
>>the rating calculation more accurate.  For instance, you will have a hard time
>>convincing almost any of the computer afficionados here that Fritz is 2580+ ElO.
>> In fact the  SSDF makes it quite clear that the ssdf rating doesn't necessarily
>>correspond to Human Elo.  So the attempt by reassigning the rating is simply a
>>first step in normalizing the ratings with Human Elos so that computer ratings
>>and human ratings are comparative(for the reason that currently they are not).
>
>
>your point is valid, but the 2400 seems wrong.  IE what convinces you that you
>should start there?  For example, take a 1800 program and start it there and
>notice what it does to the other program ratings? They go up, but they should
>not.  So you end up with what is commonly called "rating inflation."  That's
>why most rating systems have a provisional period to provide a better estimate
>on the rating based on results against those with "known" ratings.


Indeed if you played an 1800 strength program, then indeed there would be some
rating inflation.  However, for what concerns us here, that bieng the so called
"top programs", we should test them only against other top programs as is pretty
much the practice of the SSDF, or at least only include there perfromances
against the top programs in our calculations.  Yes indeed this would seem to be
a discriminatory practice, however if it is our GOAL to come to more accurate
conclusions about the strength of the CURRENT top programs then this is a
perfectly acceptable method to do so.  This is for the purpose of evaluating a
computer programs strength vs a proven opponent(a relatively known and
trustworthy variable).  Currently it is commonly accepted by most and amazingly
and apperently by even you, that the "top programs" are at least 2400 strength.
This acceptance of minimum strength(i.e. 2400) of top programs, provides a
strong foundation on which to be able test for and provide a relatively accurate
rating(all ratings bieng of course only relatively accurate even for humans).
>
>The only way to get reasonable Elo ratings for programs is to play them against
>humans, and not against each other.

I know that this type of statement seems intuitively correct.   The fault that i
find whith it is this firstly, in testing for a rating it is best to test
against multiple opponents, and all opponents will not spot the weakness, or
necessarily take advantage of it in the same capable way.  The reason that you
can get a relatively good rating is because you are testing against multiple
styles of play from different opponents.  This is slightly conceptually
difficult to understand so i will provide an example.  " I have an opponent he
is rated lower than me, but this particular opponent has a style that causes me
particular difficulty and he beats me almost all the time.  Yet i perform in
tournaments to a degree, he has never come close to.  So he is is like your
computer that always spots the weakness, but my play against multiple opponents
counteracts this effect on my rating.  If however you have a program that loses
to all (or almost all)opponents, regardles of the reason you give, ultimately
that program lost because it is weaker thus its rating will drop.  Human players
of comparable streak to the winning computer would take advantage of the
weakness quite likely as well.

 Computer vs Computer is a vastly different
>game than computer vs human.  You can take any program, make a fairly serious
>change to the eval, or to the search extensions, and not see much difference
>against a pool of humans.

If this is a weakness which has been induced such that it is always taken
advanbtage of by computers it can be taken advantage of by quite a few humans as
well, and if it isn't then the program is still stronger overall than the human
which was played.

But a strong computer opponent will quite quickly
>"home in" on such a problem and make you lose game after game.
>
>Ie in the famous first paper on singular extensions, Hsu and company reported a
>really significant rating change, when comparing DT with SE to DT without.  They
>later noticed that the difference was way over-exaggerated, because the only
>difference between the two programs was SE.  Their last paper suggested that SE
>was a much more modest improvement.

I'm not certain who or what they were testing against to get the rating, but if
the testing was done only against one or two opponents (which i strongly
suspect), then this is were the error lies.  It's just like i mentioned the
weaker player who always beats me, if you based his rating strictly on the games
between us, his rating would be over 2250( hundreds of points stronger than his
real strength).
>
>If I simply took crafty as 2,000 Elo, for version 1, and then played each
>successive version against the previous one, and used the traditional Elo rating
>calculation, I would now be somewhere around 5500+.  Because minor changes make
>major differences in the results between A and B, yet do very little in A vs H,
>where H is a human.

This would be far from the case however if you placed it in a pool with all top
programs.
>
>
>
>
>>
>> Because programs only learn to avoid certain lines, they really
>>>don't learn like humans anyway so no rating system will make their ratings like
>>>human ratings. Besides the SSDF list is only good for comparative purposes.
>>
>>That's the problem it's not good for comparative purposes, i wish it was i'm
>>sure you have seen my disccusions on here demonstrating how Fritz is GM strength
>> (which it is). However ,apparently it's difficult to show that using the
>>current SSDF system, because OBVIOUSLY many people don't accept it.  If they did
>>when i said Fritz is GM strength because it's elo is 2589, there would be no
>>disagreement.
>>
>
>
>the problem is that SSDF has too much inbreeding in the ratings.  And no one
>has ever taken the time, nor gone to the expense, to enter a computer into FIDE
>tournaments (FIDE membership is possible, but finding a tournament that would
>allow computers might be much more difficult).  So it is quite possible that
>fritz, at 2580, is 400 points better than the fidelity mach III at 2180.  But
>would that hold up in human events?  I doubt it.  I suspect Fritz would lose
>more than expected, and the Mach III would win more than expected.  For the
>reasons I gave above.
>

As for fritz it might do better or worse, than a 2580 ELO, it depends on lots of
factors, such as tournament type, Fritz would do very well in a swiss system
tournament.  It would still be 2500 ELO at the least i'm certain.  Playing in
human tournaments would be quite beneficial.  Though i would always much prefer
data from swiss events as compared to invitationals.
>
>
>
>>
>>You
>>>are attaching too much importance to the isolated rating number.
>>
>>No i'm not.  Ratings are all important, it's the only way to show the relative
>>strength of computers to human strength.  Thus it is very important to isolate a
>>VALID rating for a program firstly, so that you can no how computers really
>>compare to humans, and secondly, how so that we can gauge exactly how far along
>>the evolutionary tract programs are.
>>
>
>
>
>correct, but not easily doable.  IE computer vs computer has *nothing* to do
>with computers in the human-chess-tournament world.

Well i would agree that one computer vs a single other computer wouldn't mean
much, but against multiple styles of programs, indeed i believe you can garner a
rating of relatively strong reliability.

 Because it is all about
>statistics, and given two different "pools" of players, the absoluate ratings
>would vary significantly, and the spread would vary as well, because the
>expected outcome of computer vs computer is different than computer vs human.

This statistical outcome is only different because of the point that this thread
is making, that point bieng the current rating system of calculating computer
ratings incrementally like humans isn't accurate for computer usage.  I believe
the procedure i have outlined is a fair degree more accurate and would be a
right step in making the ratings of computers have the same statistical effect
against human populations.  There is a problem though of human bias that i wont
go into to much detail about though, Such as the fact that i can beat CM 4000
turbo a higher prcentage of the time than average, because i beat it once, and i
 can often repeat the same game, there is some possibility of this in tournament
play for computers.  And anti computer chess as opposed to regular chess play,
though i'm starting to suspect that for the non-grandmaster attempts at
anticomputer chess will garner more losses than wins.
>
>Fritz is ideally suited to play other computers.  Very fast, very deep.  But
>I'd expect it to do worse than some other programs against a group of GM
>players.  Anand was an example. Shredded Fritz trivially, had to work to beat
>Rebel in the two slow games.  Yet I'm sure that fritz will beat Rebel in a
>match, as has been seen on SSDF.

Well the fritz didn't get to play any 40/2 games so i don't know how it would
have done. I would though point out that i have read a quote of anand saying he
plays fritz all the time.  When i play weaker opponents at the club, who for
some reason don't want to change the way they are playing my win percentage
increases, Anand had this advantage with fritz.
>

I'm personally begginning to think that Chessmaster tested on the new faster
hardware is the strongest program, it has no optimized books, it isn't tuned
against other programs, and yet it just beat rebel 7 to 2 in a 40/2 match on
leubkes page, that's besides my own testing which bears out pretty much the same
though not quite i'd think 7 to 2.

>I'm more interested in the computer vs human games, but I do pay attention to
>computer vs computer when possible...
>
>
>> Ratings abhor a
>>>vacuum. You need lots of competitors to have a good system and the SSDF is a
>>>closed shop.
>>
>>No they are not a closed shop, as the data is readily available to be examined
>>and calculated by anyone with the inclination.  They have no stranglehold on the
>>knowledge of how to calculate ratings, and if you look at another of the follow
>>ups to this post, you will find that the SSDF is in fact instituting a plan
>>similar to the one the i have suggested(recalculating from scratch, not
>>incrementally).
>>
>
>but it has the same problem..  because someone will still assume that Elo 2500
>is the same as SSDF 2500.  And it still won't be so, until the games come from
>players in a *common population*...
>
No it's not a problem because on a corrected system a ssdf 2500 would be
relatively equivalent to a ELO 2500.  Games within a common population would
makeit even more accurate(more than likely), despite this though you can still
come to a relatively accurate rating.
>
>
>
>>
>>Shaun



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.