Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Calculating Computer Ratings????

Author: Shaun Graham

Date: 12:39:47 08/03/98

Go up one level in this thread



>
>However, the "standard" still applies... If you *assume* all programs are 2400
>and one is a "killer"...  then *that* program is going to be 2800+ very quickly,


Robert i think you are just trying to argue again:), because if there was a
program so "killer" that it could defeat all top programs all the time, then it
would be 2800! Anand has just demonstrated, that even he can't beat all top
programs all the time!  Just say that we put Deep Blue in the mix, i doubt that
even it would be 100%, probably not even 90%, but it would probably end up bieng
2690 or even 2700, and that's because it is(to say it isn't is literally an
insult to kasparov)!!  In test trials of doing just what i have said the other
programs are all in the 2450-2550 rating range.  None of them are as you put it
"killer" for the reason that a 2800 program doesn't exist(unless it's deep
blue:)).

>solely because it can toast a program that seems to be 2400.  Starting at 2400
>is wrong...  it should use the normal "performance rating" approach for the
>first 24 games to get a good starting point.

The formula that i have proposed is based on the  provisional formula, but as i
said earlier and someone else pointed out, with  computers it makes no sense to
stop using this formula after 24 games, because then you would be in a situation
again of getting ratings incrementally, which does not work well with computers.
 And there is no good objection to not starting the programs at an ELO of 2400,
for the reason that we know that they are at LEAST this strength.  In fact in
USCF and FIDE rules, if a tournament director knows the strength of an unrated
player he at his discretion can assign a rating of strength(an ELO rating).  If
that rating is less than the actual strength of the player, then his performance
from that point on will push the rating to what it should be.  The only thing
one might find from what you are saying is that the programs are even stronger
than 2400, which would be born out anyway if you started them at 2400 after they
stared winning games.  But by starting them out at 2400 you avoid the
questioning of the ratings bieng too high.

 And even then, the ratings won't
>be comparable to FIDE because the games aren't played in that rating pool.
>
>
>The main point is that computer vs computer is far different from computer vs
>human.  Small differences in a single program can produce lopsided match results
>when the two programs are equal everywhere else.

Yes between two programs, but not between 7 or 8 programs. this lopsidedness is
canceled out.
>
>
>
>>>
>>>The only way to get reasonable Elo ratings for programs is to play them against
>>>humans, and not against each other.
>>
>>I know that this type of statement seems intuitively correct.   The fault that i
>>find whith it is this firstly, in testing for a rating it is best to test
>>against multiple opponents, and all opponents will not spot the weakness, or
>>necessarily take advantage of it in the same capable way.  The reason that you
>>can get a relatively good rating is because you are testing against multiple
>>styles of play from different opponents.  This is slightly conceptually
>>difficult to understand so i will provide an example.  " I have an opponent he
>>is rated lower than me, but this particular opponent has a style that causes me
>>particular difficulty and he beats me almost all the time.  Yet i perform in
>>tournaments to a degree, he has never come close to.  So he is is like your
>>computer that always spots the weakness, but my play against multiple opponents
>>counteracts this effect on my rating.  If however you have a program that loses
>>to all (or almost all)opponents, regardles of the reason you give, ultimately
>>that program lost because it is weaker thus its rating will drop.  Human players
>>of comparable streak to the winning computer would take advantage of the
>>weakness quite likely as well.
>
>
>your "weakness" isn't a weakness.  It is based on statistics and if you play
>a pool over and over, and your friend plays the same pool of players, your
>ratings are going to stabilize at points that reflect your skills against
>that pool of players.

What you are saying misses the point, we have played in the same pool over and
over and i'm 200 points higher rated, yet he beats me. Why does he beat me?  The
answer as best i can figure it, is simply that he as a unique combination of
just happening to be strongest at playing against the lines that i play, and his
strengths simply happen to be were my weaknesses are.  That conan doyle axiom,
when you have examined all other possibilities, whatever is left must be the
truth.  We have both played a significant number of games against each other,
and a significant number of tournament games against the larger pool.  He beats
me, but can't perform against the pool as well as i can.  Amongst chessplayers
this sort of thing is not a rarety. Fischer once spoke of Tal "He's beaten me 4
times in a row, but i still say he plays unsound chess"(Gellers name could have
even been substituted in that quote).  So the result is overall i am the better
player(against the pool), but when the pool is reduced to just he and I, it
appears he is the better player.  Counterintuitive certainly, but that's the way
it is.  It's sort of as if he has "anti-shaun" programming, it works well
against shaun, but not the rest of the pool.  Further considering that i'm the
current Reserve state champion, and also the former blitz champion it doesn't
have anything to do with me bieng imconsistant or unlucky.

It does *not* suggest that the two of you are going to
>have a specific game-outcome with 100% reliability... it just says that after
>both have played that large pool of players, your probability of winning against
>them is X, while his is Y.

Yes i may have a probability of scoring x against the pool and he has a
probability of y, but that does not necessarily and in this case obviously
doesn't have much to do with his probability against me.
>
>When you take your example, you are highlighting the very issue I bring up in
>computer chess:  a program to beat other programs is quite a bit different from
>a program to beat humans.  And when you play comp vs comp, you find the program
>that is best suited to beat other programs (fritz, for example) while when
>playing comp vs human, you will probably find that a *different* program will
>maintain the highest rating...
>
This is a logical possibility, only if the pool of programs you are testing are
considerably alike in their play.  However if against a pool of programs whith
considerably different styles, if a single program performs well against this
pool of multiple styles the best, then the likelyhood will be that it is the
program that plays best against humans as well.  For the reason that it can deal
with  multiple styles of play the best.  As far as what i have seen there is
consdierable difference in the styles of programs(genius a overly defensive
player) chessmaster(master of the initiative), Hiarcs( a bum program :)), Rebel
( a proven solid mostly positional style) Fritz (a magician).


>
>
>
>>
>> Computer vs Computer is a vastly different
>>>game than computer vs human.  You can take any program, make a fairly serious
>>>change to the eval, or to the search extensions, and not see much difference
>>>against a pool of humans.
>>
>>If this is a weakness which has been induced such that it is always taken
>>advanbtage of by computers it can be taken advantage of by quite a few humans as
>>well, and if it isn't then the program is still stronger overall than the human
>>which was played.
>
>
>problem is  it can be a "tiny" weakness.  But if two programs know *everything*
>that the other one knows, the one with one extra piece of knowledge (assuming it
>is useful of course) has an advantage.  IE two trains on the same track heading
>in opposite directions, one has 12,400 horsepower, the other has 12,401.  The
>extra horsepower is going to eventually win the pulling contest.  While a
>human probably couldn't tell the difference...

This sort of analogy doesn't really work with chess, for the reason, that in
every game, that 1 extra thing isn't going to always be applicable in the
positions of the game.  Program "A" may have a slight tactical weakness, and
plays the sicillian, and gets munched, the next game it plays, a closed Ruy, the
effects of the tactical weakness wont necessarily be able to be exploited to the
same degree or at all, and it may be stronger positionally and win the game.
This is because sometimes having that 1 extra thing is a weakness, and sometimes
it is a strength, it depends on the game.
>
>This makes version-a vs version-b testing *very* difficult.  Because it is
>actually possible to write code that is worse, but which produces a better
>match result against that old version.
>
>
As i said it's pointless to simply test "A" and "B" you must test against many
opponents.
>>
>>But a strong computer opponent will quite quickly
>>>"home in" on such a problem and make you lose game after game.

As i said different games require different strengths, so a program shouldn't
lose game after game, unless ultimately it is much weaker against the larger
pool as well.
>>>
>>>Ie in the famous first paper on singular extensions, Hsu and company reported a
>>>really significant rating change, when comparing DT with SE to DT without.  They
>>>later noticed that the difference was way over-exaggerated, because the only
>>>difference between the two programs was SE.  Their last paper suggested that SE
>>>was a much more modest improvement.
>>
>>I'm not certain who or what they were testing against to get the rating, but if
>>the testing was done only against one or two opponents (which i strongly
>>suspect), then this is were the error lies.  It's just like i mentioned the
>>weaker player who always beats me, if you based his rating strictly on the games
>>between us, his rating would be over 2250( hundreds of points stronger than his
>>real strength).
>>>
>>>If I simply took crafty as 2,000 Elo, for version 1, and then played each
>>>successive version against the previous one, and used the traditional Elo rating
>>>calculation, I would now be somewhere around 5500+.  Because minor changes make
>>>major differences in the results between A and B, yet do very little in A vs H,
>>>where H is a human.
>>
>>This would be far from the case however if you placed it in a pool with all top
>>programs.
>
>
>Maybe, or maybe not.  Come on robert you know there is no way crafty would be 5500 against a pool of multiple opponents.

Because it might do well in that pool, but get swamped
>by a group of strong humans.  Or it might do badly in *that* pool, but swamp
>a group of humans that would wipe that "electronic" group clean.  The issue of
>a "rating" is foreign to what is being done so far.  The only way to get a
>rating (estimate of outcome vs players in a known rating pool) is to play in
>that pool.  Not in a comp-vs-comp pool that will almost guarantee a vastly
>different rating order...

There might be some difference, but i doubt there would be significant
difference, once you have established that a set of multiple programs(different
styles) are at least human 2400 elo.  This is for the reason that usually when
you seperate two pools the results don't become immediately skewed but rather
begin to skew as the two populations change.  However because computers do not
change as do other sorts of types of pools(of comparative things)  you simply
will not have an increasing divergence.  Since this divergence is considerably
eliminated the pools, them selves aren't really that much different from each
other.
>
>It will certainly predict the outcome for the comp-vs-comp games... but no one
>uses the number like that... they try to extrapolate the results of comp vs
>human games, based on the rating obtained in comp vs comp games.  And it won't
>work, ever...
>
>

It of course would be best if you could have many test games between humans, but
 because once the rating has been established that the programs are Human
Elo(preferably 2400).  The pools will be significantly similar because the
computers will not diverge(in other words a computer 2400 will not change)
Though it is forseeable that in the human pool the meaning of a human 2400 could
possibly change, though i suspect this is unlikely.  Because this is a unique
situation were the possibility of significant divergence is limited, it makes
testing between computers comparable with games between humans.  Currently the
only problem is(at least with the ssdf) is that they didn't start with a known
comparative value, I.E. a 2400 elo rating for a computer.
>
>>>
>>>
>>>
>>>
>>>>
>>>> Because programs only learn to avoid certain lines, they really
>>>>>don't learn like humans anyway so no rating system will make their ratings like
>>>>>human ratings. Besides the SSDF list is only good for comparative purposes.
>>>>
>>>>That's the problem it's not good for comparative purposes, i wish it was i'm
>>>>sure you have seen my disccusions on here demonstrating how Fritz is GM strength
>>>> (which it is). However ,apparently it's difficult to show that using the
>>>>current SSDF system, because OBVIOUSLY many people don't accept it.  If they did
>>>>when i said Fritz is GM strength because it's elo is 2589, there would be no
>>>>disagreement.
>>>>
>>>
>>>
>>>the problem is that SSDF has too much inbreeding in the ratings.  And no one
>>>has ever taken the time, nor gone to the expense, to enter a computer into FIDE
>>>tournaments (FIDE membership is possible, but finding a tournament that would
>>>allow computers might be much more difficult).  So it is quite possible that
>>>fritz, at 2580, is 400 points better than the fidelity mach III at 2180.  But
>>>would that hold up in human events?  I doubt it.  I suspect Fritz would lose
>>>more than expected, and the Mach III would win more than expected.  For the
>>>reasons I gave above.
>>>
>>
>>As for fritz it might do better or worse, than a 2580 ELO, it depends on lots of
>>factors, such as tournament type, Fritz would do very well in a swiss system
>>tournament.  It would still be 2500 ELO at the least i'm certain.  Playing in
>>human tournaments would be quite beneficial.  Though i would always much prefer
>>data from swiss events as compared to invitationals.
>>>
>>>
>>>
>>>>
>>>>You
>>>>>are attaching too much importance to the isolated rating number.
>>>>
>>>>No i'm not.  Ratings are all important, it's the only way to show the relative
>>>>strength of computers to human strength.  Thus it is very important to isolate a
>>>>VALID rating for a program firstly, so that you can no how computers really
>>>>compare to humans, and secondly, how so that we can gauge exactly how far along
>>>>the evolutionary tract programs are.
>>>>
>>>
>>>
>>>
>>>correct, but not easily doable.  IE computer vs computer has *nothing* to do
>>>with computers in the human-chess-tournament world.
>>

I disagree for reasons given above.


>>Well i would agree that one computer vs a single other computer wouldn't mean
>>much, but against multiple styles of programs, indeed i believe you can garner a
>>rating of relatively strong reliability.
>>
>
>
>Stop and ask a good GM about computers and humans for opponents.  He'll give
>you more info, quicker, than I can.  But the ones I know can (a) tell almost
>immediately (within a game or two) if they are playing computers and (b) will
>alter their style knowing this.
>

???
>
>
>> Because it is all about
>>>statistics, and given two different "pools" of players, the absoluate ratings
>>>would vary significantly, and the spread would vary as well, because the
>>>expected outcome of computer vs computer is different than computer vs human.
>>
>>This statistical outcome is only different because of the point that this thread
>>is making, that point bieng the current rating system of calculating computer
>>ratings incrementally like humans isn't accurate for computer usage.  I believe
>>the procedure i have outlined is a fair degree more accurate and would be a
>>right step in making the ratings of computers have the same statistical effect
>>against human populations.  There is a problem though of human bias that i wont
>>go into to much detail about though, Such as the fact that i can beat CM 4000
>>turbo a higher prcentage of the time than average, because i beat it once, and i
>> can often repeat the same game, there is some possibility of this in tournament
>>play for computers.  And anti computer chess as opposed to regular chess play,
>>though i'm starting to suspect that for the non-grandmaster attempts at
>>anticomputer chess will garner more losses than wins.
>
>
>At sub-IM levels, I agree.  But IM's are quite capable of using "anti-computer"
>strategies and most are positionally and tactically strong enough to not get
>in over their heads.  The danger is taking "anti-computer" to the level where
>you end up in a position you don't quite understand, or worse...
>
Ask Dean Hergott.  Neither here nor there, as i hold the top programs are GM
strength so it really doesn't matter what the I.M. does. GM Defirmian says in
"how to play better chess" "If you are a GM you should be able to overpower the
IM tactically.  The GM will often blow out the IM in this area"(pg 6).
Considering that tactics are almost universally argued to be the strong points
of computers this is a considerable bolstering of there positions against I.Ms.
No I.M. i can think of would have held that 40/2 match that occurred between
rebel10 and Anand.
>
>
>>>
>>>Fritz is ideally suited to play other computers.  Very fast, very deep.  But
>>>I'd expect it to do worse than some other programs against a group of GM
>>>players.  Anand was an example. Shredded Fritz trivially, had to work to beat
>>>Rebel in the two slow games.  Yet I'm sure that fritz will beat Rebel in a
>>>match, as has been seen on SSDF.
>>
>>Well the fritz didn't get to play any 40/2 games so i don't know how it would
>>have done. I would though point out that i have read a quote of anand saying he
>>plays fritz all the time.  When i play weaker opponents at the club, who for
>>some reason don't want to change the way they are playing my win percentage
>>increases, Anand had this advantage with fritz.
>>>
>>
>>I'm personally begginning to think that Chessmaster tested on the new faster
>>hardware is the strongest program, it has no optimized books, it isn't tuned
>>against other programs, and yet it just beat rebel 7 to 2 in a 40/2 match on
>>leubkes page, that's besides my own testing which bears out pretty much the same
>>though not quite i'd think 7 to 2.
>>
>
>
>
>If you go back to r.g.c.c a couple of years ago, I pointed out that of all the
>programs I was playing on the chess servers, ChessMaster consistently gave me
>the most problems.  It is good, and will continue to be good, IMHO...
>
This we can agree on, Chessmaster is benefitting hugely from the faster machines
now(the 5000 engine was only tested on a P90).  I wouldn't be surprised by a #1
ranking of 6000, though i think the strength really starts to kick in when
tested on at least a P233.  Unfortunately,  the opening book sucks:(.



>
>>>I'm more interested in the computer vs human games, but I do pay attention to
>>>computer vs computer when possible...
>>>
>>>
>>>> Ratings abhor a
>>>>>vacuum. You need lots of competitors to have a good system and the SSDF is a
>>>>>closed shop.
>>>>
>>>>No they are not a closed shop, as the data is readily available to be examined
>>>>and calculated by anyone with the inclination.  They have no stranglehold on the
>>>>knowledge of how to calculate ratings, and if you look at another of the follow
>>>>ups to this post, you will find that the SSDF is in fact instituting a plan
>>>>similar to the one the i have suggested(recalculating from scratch, not
>>>>incrementally).
>>>>
>>>
>>>but it has the same problem..  because someone will still assume that Elo 2500
>>>is the same as SSDF 2500.  And it still won't be so, until the games come from
>>>players in a *common population*...
>>>
>>No it's not a problem because on a corrected system a ssdf 2500 would be
>>relatively equivalent to a ELO 2500.  Games within a common population would
>>makeit even more accurate(more than likely), despite this though you can still
>>come to a relatively accurate rating.
>
>
>Here we just have to agree to disagree.  Elo is all about sampling theory and
>probability analysis.  There is *no* way to normalize ratings between two
>different sampling groups.  Other than to combine them and let them play in
>the same pool.
>
>

Yes you can if you can't find a significant difference between the groups.
Especially if when they have been pooled at times, and there is no reason for
the two groups to diverge in nature once seperatedinto two pools.
>>>
>>>
>>>
>>>>
>>>>Shaun



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.