Author: Shaun Graham
Date: 21:57:23 08/02/98
Go up one level in this thread
On August 02, 1998 at 09:49:34, Robert Hyatt wrote: >On August 01, 1998 at 01:14:57, Shaun Graham wrote: > >> >>>Well if you would assign a rating of 2400 to all of your opponents, why wouldn't >>>you do this to your program? >> >>In fact you do first assign your program a rating of 2400, but then you see how >>your program a 2400 has performed against all other 2400's to get the new >>rating. >> >> >> I admit that when the rating system was first >>>started, the ratings had to be assigned for at least 1 program to start it off, >>>but now that we have established ratings (SSDF for example) why would we need to >>>assign ratings? >> >>You need to assign ratings only because here the attempt is being more to make >>the rating calculation more accurate. For instance, you will have a hard time >>convincing almost any of the computer afficionados here that Fritz is 2580+ ElO. >> In fact the SSDF makes it quite clear that the ssdf rating doesn't necessarily >>correspond to Human Elo. So the attempt by reassigning the rating is simply a >>first step in normalizing the ratings with Human Elos so that computer ratings >>and human ratings are comparative(for the reason that currently they are not). > > >your point is valid, but the 2400 seems wrong. IE what convinces you that you >should start there? For example, take a 1800 program and start it there and >notice what it does to the other program ratings? They go up, but they should >not. So you end up with what is commonly called "rating inflation." That's >why most rating systems have a provisional period to provide a better estimate >on the rating based on results against those with "known" ratings. Indeed if you played an 1800 strength program, then indeed there would be some rating inflation. However, for what concerns us here, that bieng the so called "top programs", we should test them only against other top programs as is pretty much the practice of the SSDF, or at least only include there perfromances against the top programs in our calculations. Yes indeed this would seem to be a discriminatory practice, however if it is our GOAL to come to more accurate conclusions about the strength of the CURRENT top programs then this is a perfectly acceptable method to do so. This is for the purpose of evaluating a computer programs strength vs a proven opponent(a relatively known and trustworthy variable). Currently it is commonly accepted by most and amazingly and apperently by even you, that the "top programs" are at least 2400 strength. This acceptance of minimum strength(i.e. 2400) of top programs, provides a strong foundation on which to be able test for and provide a relatively accurate rating(all ratings bieng of course only relatively accurate even for humans). > >The only way to get reasonable Elo ratings for programs is to play them against >humans, and not against each other. I know that this type of statement seems intuitively correct. The fault that i find whith it is this firstly, in testing for a rating it is best to test against multiple opponents, and all opponents will not spot the weakness, or necessarily take advantage of it in the same capable way. The reason that you can get a relatively good rating is because you are testing against multiple styles of play from different opponents. This is slightly conceptually difficult to understand so i will provide an example. " I have an opponent he is rated lower than me, but this particular opponent has a style that causes me particular difficulty and he beats me almost all the time. Yet i perform in tournaments to a degree, he has never come close to. So he is is like your computer that always spots the weakness, but my play against multiple opponents counteracts this effect on my rating. If however you have a program that loses to all (or almost all)opponents, regardles of the reason you give, ultimately that program lost because it is weaker thus its rating will drop. Human players of comparable streak to the winning computer would take advantage of the weakness quite likely as well. Computer vs Computer is a vastly different >game than computer vs human. You can take any program, make a fairly serious >change to the eval, or to the search extensions, and not see much difference >against a pool of humans. If this is a weakness which has been induced such that it is always taken advanbtage of by computers it can be taken advantage of by quite a few humans as well, and if it isn't then the program is still stronger overall than the human which was played. But a strong computer opponent will quite quickly >"home in" on such a problem and make you lose game after game. > >Ie in the famous first paper on singular extensions, Hsu and company reported a >really significant rating change, when comparing DT with SE to DT without. They >later noticed that the difference was way over-exaggerated, because the only >difference between the two programs was SE. Their last paper suggested that SE >was a much more modest improvement. I'm not certain who or what they were testing against to get the rating, but if the testing was done only against one or two opponents (which i strongly suspect), then this is were the error lies. It's just like i mentioned the weaker player who always beats me, if you based his rating strictly on the games between us, his rating would be over 2250( hundreds of points stronger than his real strength). > >If I simply took crafty as 2,000 Elo, for version 1, and then played each >successive version against the previous one, and used the traditional Elo rating >calculation, I would now be somewhere around 5500+. Because minor changes make >major differences in the results between A and B, yet do very little in A vs H, >where H is a human. This would be far from the case however if you placed it in a pool with all top programs. > > > > >> >> Because programs only learn to avoid certain lines, they really >>>don't learn like humans anyway so no rating system will make their ratings like >>>human ratings. Besides the SSDF list is only good for comparative purposes. >> >>That's the problem it's not good for comparative purposes, i wish it was i'm >>sure you have seen my disccusions on here demonstrating how Fritz is GM strength >> (which it is). However ,apparently it's difficult to show that using the >>current SSDF system, because OBVIOUSLY many people don't accept it. If they did >>when i said Fritz is GM strength because it's elo is 2589, there would be no >>disagreement. >> > > >the problem is that SSDF has too much inbreeding in the ratings. And no one >has ever taken the time, nor gone to the expense, to enter a computer into FIDE >tournaments (FIDE membership is possible, but finding a tournament that would >allow computers might be much more difficult). So it is quite possible that >fritz, at 2580, is 400 points better than the fidelity mach III at 2180. But >would that hold up in human events? I doubt it. I suspect Fritz would lose >more than expected, and the Mach III would win more than expected. For the >reasons I gave above. > As for fritz it might do better or worse, than a 2580 ELO, it depends on lots of factors, such as tournament type, Fritz would do very well in a swiss system tournament. It would still be 2500 ELO at the least i'm certain. Playing in human tournaments would be quite beneficial. Though i would always much prefer data from swiss events as compared to invitationals. > > > >> >>You >>>are attaching too much importance to the isolated rating number. >> >>No i'm not. Ratings are all important, it's the only way to show the relative >>strength of computers to human strength. Thus it is very important to isolate a >>VALID rating for a program firstly, so that you can no how computers really >>compare to humans, and secondly, how so that we can gauge exactly how far along >>the evolutionary tract programs are. >> > > > >correct, but not easily doable. IE computer vs computer has *nothing* to do >with computers in the human-chess-tournament world. Well i would agree that one computer vs a single other computer wouldn't mean much, but against multiple styles of programs, indeed i believe you can garner a rating of relatively strong reliability. Because it is all about >statistics, and given two different "pools" of players, the absoluate ratings >would vary significantly, and the spread would vary as well, because the >expected outcome of computer vs computer is different than computer vs human. This statistical outcome is only different because of the point that this thread is making, that point bieng the current rating system of calculating computer ratings incrementally like humans isn't accurate for computer usage. I believe the procedure i have outlined is a fair degree more accurate and would be a right step in making the ratings of computers have the same statistical effect against human populations. There is a problem though of human bias that i wont go into to much detail about though, Such as the fact that i can beat CM 4000 turbo a higher prcentage of the time than average, because i beat it once, and i can often repeat the same game, there is some possibility of this in tournament play for computers. And anti computer chess as opposed to regular chess play, though i'm starting to suspect that for the non-grandmaster attempts at anticomputer chess will garner more losses than wins. > >Fritz is ideally suited to play other computers. Very fast, very deep. But >I'd expect it to do worse than some other programs against a group of GM >players. Anand was an example. Shredded Fritz trivially, had to work to beat >Rebel in the two slow games. Yet I'm sure that fritz will beat Rebel in a >match, as has been seen on SSDF. Well the fritz didn't get to play any 40/2 games so i don't know how it would have done. I would though point out that i have read a quote of anand saying he plays fritz all the time. When i play weaker opponents at the club, who for some reason don't want to change the way they are playing my win percentage increases, Anand had this advantage with fritz. > I'm personally begginning to think that Chessmaster tested on the new faster hardware is the strongest program, it has no optimized books, it isn't tuned against other programs, and yet it just beat rebel 7 to 2 in a 40/2 match on leubkes page, that's besides my own testing which bears out pretty much the same though not quite i'd think 7 to 2. >I'm more interested in the computer vs human games, but I do pay attention to >computer vs computer when possible... > > >> Ratings abhor a >>>vacuum. You need lots of competitors to have a good system and the SSDF is a >>>closed shop. >> >>No they are not a closed shop, as the data is readily available to be examined >>and calculated by anyone with the inclination. They have no stranglehold on the >>knowledge of how to calculate ratings, and if you look at another of the follow >>ups to this post, you will find that the SSDF is in fact instituting a plan >>similar to the one the i have suggested(recalculating from scratch, not >>incrementally). >> > >but it has the same problem.. because someone will still assume that Elo 2500 >is the same as SSDF 2500. And it still won't be so, until the games come from >players in a *common population*... > No it's not a problem because on a corrected system a ssdf 2500 would be relatively equivalent to a ELO 2500. Games within a common population would makeit even more accurate(more than likely), despite this though you can still come to a relatively accurate rating. > > > >> >>Shaun
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.