Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Calculating Computer Ratings????

Author: Robert Hyatt

Date: 07:16:00 08/03/98

Go up one level in this thread


On August 03, 1998 at 00:57:23, Shaun Graham wrote:

>On August 02, 1998 at 09:49:34, Robert Hyatt wrote:
>
>>On August 01, 1998 at 01:14:57, Shaun Graham wrote:
>>
>>>
>>>>Well if you would assign a rating of 2400 to all of your opponents, why wouldn't
>>>>you do this to your program?
>>>
>>>In fact you do first assign your program a rating of 2400, but then you see how
>>>your program a 2400 has performed against all other 2400's to get the new
>>>rating.
>>>
>>>
>>> I admit that when the rating system was first
>>>>started, the ratings had to be assigned for at least 1 program to start it off,
>>>>but now that we have established ratings (SSDF for example) why would we need to
>>>>assign ratings?
>>>
>>>You need to assign ratings only because here the attempt is being more to make
>>>the rating calculation more accurate.  For instance, you will have a hard time
>>>convincing almost any of the computer afficionados here that Fritz is 2580+ ElO.
>>> In fact the  SSDF makes it quite clear that the ssdf rating doesn't necessarily
>>>correspond to Human Elo.  So the attempt by reassigning the rating is simply a
>>>first step in normalizing the ratings with Human Elos so that computer ratings
>>>and human ratings are comparative(for the reason that currently they are not).
>>
>>
>>your point is valid, but the 2400 seems wrong.  IE what convinces you that you
>>should start there?  For example, take a 1800 program and start it there and
>>notice what it does to the other program ratings? They go up, but they should
>>not.  So you end up with what is commonly called "rating inflation."  That's
>>why most rating systems have a provisional period to provide a better estimate
>>on the rating based on results against those with "known" ratings.
>
>
>Indeed if you played an 1800 strength program, then indeed there would be some
>rating inflation.  However, for what concerns us here, that bieng the so called
>"top programs", we should test them only against other top programs as is pretty
>much the practice of the SSDF, or at least only include there perfromances
>against the top programs in our calculations.  Yes indeed this would seem to be
>a discriminatory practice, however if it is our GOAL to come to more accurate
>conclusions about the strength of the CURRENT top programs then this is a
>perfectly acceptable method to do so.  This is for the purpose of evaluating a
>computer programs strength vs a proven opponent(a relatively known and
>trustworthy variable).  Currently it is commonly accepted by most and amazingly
>and apperently by even you, that the "top programs" are at least 2400 strength.
>This acceptance of minimum strength(i.e. 2400) of top programs, provides a
>strong foundation on which to be able test for and provide a relatively accurate
>rating(all ratings bieng of course only relatively accurate even for humans).

However, the "standard" still applies... If you *assume* all programs are 2400
and one is a "killer"...  then *that* program is going to be 2800+ very quickly,
solely because it can toast a program that seems to be 2400.  Starting at 2400
is wrong...  it should use the normal "performance rating" approach for the
first 24 games to get a good starting point.  And even then, the ratings won't
be comparable to FIDE because the games aren't played in that rating pool.


The main point is that computer vs computer is far different from computer vs
human.  Small differences in a single program can produce lopsided match results
when the two programs are equal everywhere else.



>>
>>The only way to get reasonable Elo ratings for programs is to play them against
>>humans, and not against each other.
>
>I know that this type of statement seems intuitively correct.   The fault that i
>find whith it is this firstly, in testing for a rating it is best to test
>against multiple opponents, and all opponents will not spot the weakness, or
>necessarily take advantage of it in the same capable way.  The reason that you
>can get a relatively good rating is because you are testing against multiple
>styles of play from different opponents.  This is slightly conceptually
>difficult to understand so i will provide an example.  " I have an opponent he
>is rated lower than me, but this particular opponent has a style that causes me
>particular difficulty and he beats me almost all the time.  Yet i perform in
>tournaments to a degree, he has never come close to.  So he is is like your
>computer that always spots the weakness, but my play against multiple opponents
>counteracts this effect on my rating.  If however you have a program that loses
>to all (or almost all)opponents, regardles of the reason you give, ultimately
>that program lost because it is weaker thus its rating will drop.  Human players
>of comparable streak to the winning computer would take advantage of the
>weakness quite likely as well.


your "weakness" isn't a weakness.  It is based on statistics and if you play
a pool over and over, and your friend plays the same pool of players, your
ratings are going to stabilize at points that reflect your skills against
that pool of players.  It does *not* suggest that the two of you are going to
have a specific game-outcome with 100% reliability... it just says that after
both have played that large pool of players, your probability of winning against
them is X, while his is Y.

When you take your example, you are highlighting the very issue I bring up in
computer chess:  a program to beat other programs is quite a bit different from
a program to beat humans.  And when you play comp vs comp, you find the program
that is best suited to beat other programs (fritz, for example) while when
playing comp vs human, you will probably find that a *different* program will
maintain the highest rating...




>
> Computer vs Computer is a vastly different
>>game than computer vs human.  You can take any program, make a fairly serious
>>change to the eval, or to the search extensions, and not see much difference
>>against a pool of humans.
>
>If this is a weakness which has been induced such that it is always taken
>advanbtage of by computers it can be taken advantage of by quite a few humans as
>well, and if it isn't then the program is still stronger overall than the human
>which was played.


problem is  it can be a "tiny" weakness.  But if two programs know *everything*
that the other one knows, the one with one extra piece of knowledge (assuming it
is useful of course) has an advantage.  IE two trains on the same track heading
in opposite directions, one has 12,400 horsepower, the other has 12,401.  The
extra horsepower is going to eventually win the pulling contest.  While a
human probably couldn't tell the difference...

This makes version-a vs version-b testing *very* difficult.  Because it is
actually possible to write code that is worse, but which produces a better
match result against that old version.



>
>But a strong computer opponent will quite quickly
>>"home in" on such a problem and make you lose game after game.
>>
>>Ie in the famous first paper on singular extensions, Hsu and company reported a
>>really significant rating change, when comparing DT with SE to DT without.  They
>>later noticed that the difference was way over-exaggerated, because the only
>>difference between the two programs was SE.  Their last paper suggested that SE
>>was a much more modest improvement.
>
>I'm not certain who or what they were testing against to get the rating, but if
>the testing was done only against one or two opponents (which i strongly
>suspect), then this is were the error lies.  It's just like i mentioned the
>weaker player who always beats me, if you based his rating strictly on the games
>between us, his rating would be over 2250( hundreds of points stronger than his
>real strength).
>>
>>If I simply took crafty as 2,000 Elo, for version 1, and then played each
>>successive version against the previous one, and used the traditional Elo rating
>>calculation, I would now be somewhere around 5500+.  Because minor changes make
>>major differences in the results between A and B, yet do very little in A vs H,
>>where H is a human.
>
>This would be far from the case however if you placed it in a pool with all top
>programs.


Maybe, or maybe not.  Because it might do well in that pool, but get swamped
by a group of strong humans.  Or it might do badly in *that* pool, but swamp
a group of humans that would wipe that "electronic" group clean.  The issue of
a "rating" is foreign to what is being done so far.  The only way to get a
rating (estimate of outcome vs players in a known rating pool) is to play in
that pool.  Not in a comp-vs-comp pool that will almost guarantee a vastly
different rating order...

It will certainly predict the outcome for the comp-vs-comp games... but no one
uses the number like that... they try to extrapolate the results of comp vs
human games, based on the rating obtained in comp vs comp games.  And it won't
work, ever...





>>
>>
>>
>>
>>>
>>> Because programs only learn to avoid certain lines, they really
>>>>don't learn like humans anyway so no rating system will make their ratings like
>>>>human ratings. Besides the SSDF list is only good for comparative purposes.
>>>
>>>That's the problem it's not good for comparative purposes, i wish it was i'm
>>>sure you have seen my disccusions on here demonstrating how Fritz is GM strength
>>> (which it is). However ,apparently it's difficult to show that using the
>>>current SSDF system, because OBVIOUSLY many people don't accept it.  If they did
>>>when i said Fritz is GM strength because it's elo is 2589, there would be no
>>>disagreement.
>>>
>>
>>
>>the problem is that SSDF has too much inbreeding in the ratings.  And no one
>>has ever taken the time, nor gone to the expense, to enter a computer into FIDE
>>tournaments (FIDE membership is possible, but finding a tournament that would
>>allow computers might be much more difficult).  So it is quite possible that
>>fritz, at 2580, is 400 points better than the fidelity mach III at 2180.  But
>>would that hold up in human events?  I doubt it.  I suspect Fritz would lose
>>more than expected, and the Mach III would win more than expected.  For the
>>reasons I gave above.
>>
>
>As for fritz it might do better or worse, than a 2580 ELO, it depends on lots of
>factors, such as tournament type, Fritz would do very well in a swiss system
>tournament.  It would still be 2500 ELO at the least i'm certain.  Playing in
>human tournaments would be quite beneficial.  Though i would always much prefer
>data from swiss events as compared to invitationals.
>>
>>
>>
>>>
>>>You
>>>>are attaching too much importance to the isolated rating number.
>>>
>>>No i'm not.  Ratings are all important, it's the only way to show the relative
>>>strength of computers to human strength.  Thus it is very important to isolate a
>>>VALID rating for a program firstly, so that you can no how computers really
>>>compare to humans, and secondly, how so that we can gauge exactly how far along
>>>the evolutionary tract programs are.
>>>
>>
>>
>>
>>correct, but not easily doable.  IE computer vs computer has *nothing* to do
>>with computers in the human-chess-tournament world.
>
>Well i would agree that one computer vs a single other computer wouldn't mean
>much, but against multiple styles of programs, indeed i believe you can garner a
>rating of relatively strong reliability.
>


Stop and ask a good GM about computers and humans for opponents.  He'll give
you more info, quicker, than I can.  But the ones I know can (a) tell almost
immediately (within a game or two) if they are playing computers and (b) will
alter their style knowing this.



> Because it is all about
>>statistics, and given two different "pools" of players, the absoluate ratings
>>would vary significantly, and the spread would vary as well, because the
>>expected outcome of computer vs computer is different than computer vs human.
>
>This statistical outcome is only different because of the point that this thread
>is making, that point bieng the current rating system of calculating computer
>ratings incrementally like humans isn't accurate for computer usage.  I believe
>the procedure i have outlined is a fair degree more accurate and would be a
>right step in making the ratings of computers have the same statistical effect
>against human populations.  There is a problem though of human bias that i wont
>go into to much detail about though, Such as the fact that i can beat CM 4000
>turbo a higher prcentage of the time than average, because i beat it once, and i
> can often repeat the same game, there is some possibility of this in tournament
>play for computers.  And anti computer chess as opposed to regular chess play,
>though i'm starting to suspect that for the non-grandmaster attempts at
>anticomputer chess will garner more losses than wins.


At sub-IM levels, I agree.  But IM's are quite capable of using "anti-computer"
strategies and most are positionally and tactically strong enough to not get
in over their heads.  The danger is taking "anti-computer" to the level where
you end up in a position you don't quite understand, or worse...




>>
>>Fritz is ideally suited to play other computers.  Very fast, very deep.  But
>>I'd expect it to do worse than some other programs against a group of GM
>>players.  Anand was an example. Shredded Fritz trivially, had to work to beat
>>Rebel in the two slow games.  Yet I'm sure that fritz will beat Rebel in a
>>match, as has been seen on SSDF.
>
>Well the fritz didn't get to play any 40/2 games so i don't know how it would
>have done. I would though point out that i have read a quote of anand saying he
>plays fritz all the time.  When i play weaker opponents at the club, who for
>some reason don't want to change the way they are playing my win percentage
>increases, Anand had this advantage with fritz.
>>
>
>I'm personally begginning to think that Chessmaster tested on the new faster
>hardware is the strongest program, it has no optimized books, it isn't tuned
>against other programs, and yet it just beat rebel 7 to 2 in a 40/2 match on
>leubkes page, that's besides my own testing which bears out pretty much the same
>though not quite i'd think 7 to 2.
>



If you go back to r.g.c.c a couple of years ago, I pointed out that of all the
programs I was playing on the chess servers, ChessMaster consistently gave me
the most problems.  It is good, and will continue to be good, IMHO...



>>I'm more interested in the computer vs human games, but I do pay attention to
>>computer vs computer when possible...
>>
>>
>>> Ratings abhor a
>>>>vacuum. You need lots of competitors to have a good system and the SSDF is a
>>>>closed shop.
>>>
>>>No they are not a closed shop, as the data is readily available to be examined
>>>and calculated by anyone with the inclination.  They have no stranglehold on the
>>>knowledge of how to calculate ratings, and if you look at another of the follow
>>>ups to this post, you will find that the SSDF is in fact instituting a plan
>>>similar to the one the i have suggested(recalculating from scratch, not
>>>incrementally).
>>>
>>
>>but it has the same problem..  because someone will still assume that Elo 2500
>>is the same as SSDF 2500.  And it still won't be so, until the games come from
>>players in a *common population*...
>>
>No it's not a problem because on a corrected system a ssdf 2500 would be
>relatively equivalent to a ELO 2500.  Games within a common population would
>makeit even more accurate(more than likely), despite this though you can still
>come to a relatively accurate rating.


Here we just have to agree to disagree.  Elo is all about sampling theory and
probability analysis.  There is *no* way to normalize ratings between two
different sampling groups.  Other than to combine them and let them play in
the same pool.


>>
>>
>>
>>>
>>>Shaun



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.