Author: Stephen A. Boak
Date: 19:21:58 11/29/99
Go up one level in this thread
On November 29, 1999 at 20:07:51, Robert Hyatt wrote: >On November 29, 1999 at 18:30:26, Charles Unruh wrote: > >>On November 29, 1999 at 18:08:38, Tim Mirabile wrote: >> >>>On November 29, 1999 at 14:37:31, Charles Unruh wrote: >>> >>>>On November 29, 1999 at 14:08:46, Enrique Irazoqui wrote: >>> >>>>>Why don't we forget about absolute ratings and make them program-relative >>>>>instead? >>>> >>>>Because most of us as well as most consumers are not as concerned with the >>>>relative strength of comps vs comps as we are with comps vs humans. >>> >>>Then we should be playing hundreds of games with the programs against humans of >>>various strengths above and below what we think is the approximate strength of >>>the program. >> >>Well of course! That's what everyone wants but it's not going to happen, we >>don't have that so we have to use what's available. >> >> Anything less will not give you what you want to an accuracy >>>of better than a few hundred points. >> >>The ssdf was originally based on games vs humans, we can't calivrate the current >>ratings back exactly, but the drift should be able to be calclated within at >>least +/- 50 or so > > >based on what statistical theory? The rating pools have changed dramatically. >The ratings have nothing to do with the original SSDF ratings now, as none of >the original 'pool' remains active... I have the same question as Hyatt. Here's the scenario I envision: Assume programs are a combination of abilities in strategical, positional and tactical move selection. I use 'strategical' to mean a capability of long term assessment that can help steer the program into positions that are likely to be more favorable (or less unfavorable) than the current position, and avoid steering the program into positions that are likely to be less favorable (or more unfavorable). Assume programs are very weak in strategical capability, only fair to middling in the positional capability, and strong in the tactical capability. Rate these programs initially by competing them in many, many games against a pool of chess players rated in some rating system, FIDE for example. Now after obtaining initial ratings based on games against FIDE-rated human players, begin to improve the capabilities of the programs. Since programming strategical and positional ability in software is apparently more difficult than programming tactical calculations, assume all the programs are much improved tactically, step by step, over a long time, while the strategical capabilities are hardly improved at all, and the positional capabilities are very slowly improved. As various programs are improved, assume the new versions complete only against the older program versions with established FIDE ratings (with essentially no more games against FIDE rated humans). Since the newer versions will be about the same strength strategically (still very weak compared to better human players), only slightly better positionally (perhaps still only moderate, compared to better human players), but much better tactically than the older rated versions (and much improved versus better human players), the newer versions will outplay the older versions (largely due to tactical improvement) and gain rating points. Over time, assume the newer programs are competed against the relatively older but more recent programs only (never against the very oldest programs). As the program generations are released and tested against recent other program generations, the newer programs continue to climb in rating--being always better, largely for tactical reasons, than their predecessors. However, since the programs are not improving in the strategical area, and only very slowly in the positional area, they may *not* be gaining as much relative strength in real life, versus humans, as they are gaining in ELO growth from continued play versus computer programs only (no humans). When the later generations are finally tested again in many games versus stronger humans, their tactical skills hold up, and may even surpass the skill of most of some very strong human players (especially under faster time controls or in complicated positions that sometimes arise); their positional skills may not be quite as good as the strongest human opponents, since those abilities improved very slowly in the programs--even over many years. And the strategic skills of the computers may be still woefully behind strategic skills of most of the stronger human players--since that aspect is very difficult to program, and is very difficult to weight versus more short term positional and tactical factors. The stronger human players will pick the programs apart at the seams, finding the strategic and positional weaknesses of the programs. They may even find occasional 'tactical situations' to exploit, where the programs use null move approaches that on occasion discard quiet moves, without adequate analysis, that lead to winning or losing positions (moves not overlooked by strong humans). This scenario points out how programs can continually raise their ratings in a largely self-contained program pool (that was once rated in a reasonable fashion), gaining rating points at the expense of older program versions (not at the expense of rated humans!) by improving their 'program-type' skills (largely tactical) versus their prior peer programs, while at the same time gaining much less relative strength against strong humans who continue to use (and excel in) strategical and positional understanding to combat the increased tactical skills of software programs. The ratings of the computer programs will be found to be inflated. Determining the amount of the inflation a priori (in advance), before the latest computer programs again play many games against rated players, is highly problematic and extremely speculative at best. The difficulty, as pointed out by Hyatt, is that there is no exact mathematics or statistics to show how much inflation has occurred in program ratings (versus human ratings) when the two competing pools (computer and human) have diverged and remained non-overlapping for lengthy periods. There is some hope to carry out this calculation of inflation, in my opinion, in lieu of playing huge numbers of games with many programs versus many strong, rated human players. That is based on having some current programs play enough rated games against strong humans (at suitable time controls, incentives and controlled playing conditions, as in the Rebel GM challenge matches, for example). As an example of this, if Rebel 10 is assumed to be approx 2500 rating strength based on enough GM challenge matches, and a few other programs are tested in somewhat similar fashion (perhaps even in strong human open or closed tournaments, swiss or round robin), then the relative program ratings (among the computer pool) could be adjusted up or down, closer together or farther apart, to tie them as a group to the 'pegged values' of the programs tested against humans, until the relative ratings of humans and programs are perceived as better estimated (i.e. inflation seems reduced, with respect to computer program ratings and human ratings--this is difficult to 'prove' without lots of actual program-human testing, with many programs and many strong humans) and in accord with the human-rating-indexed programs such as Rebel 10. Even this is very, very speculative since the best tactical program, with the highest computer rating, may possibly have worse strategic and positional skills than a somewhat lower rated computer program, which might render its rating agains humans relatively lower than the lesser computer program (which may have better strategical and positional skills). There is no guarantee that real testing against humans will produce the same relative rankings among computer programs as the previous all-computer competition has produced. This is what Ed Schroder and others have indicated many times in many ways. This is because we have no defined or measured and tested scale against which to determine the relative effects of strategic, positional and tactical skills of a program versus a strong human, which are interrelated factors for the strength of both programs and humans. The only true test is to pit many top programs many times against many top humans, under reasonably controlled conditions. --Steve Boak
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.