Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: ELO inflation effect ... and SSDF

Author: Robert Hyatt

Date: 07:01:50 10/10/02

Go up one level in this thread


On October 09, 2002 at 21:44:41, Rolf Tueschen wrote:

>On October 09, 2002 at 12:59:24, Robert Hyatt wrote:
>
>>On October 09, 2002 at 05:54:18, Rolf Tueschen wrote:
>>
>>>On October 09, 2002 at 04:52:40, GuyHaworth wrote:
>>>
>>>>
>>>>Totally agreed:  only the differences between the ELO numbers are relevant.
>>>>
>>>>I believe there is an inflation effect in the ELO system.  Sadly, investigating
>>>>this - by theory or simulation - hasn't got to the top of my 'to do' list yet.
>>>>
>>>>Anyway, the more games played, the narrower the confidence bands on ELO figures,
>>>>but the greater the inflation.
>>>>
>>>>I believe it was for this reason, or for the sake of credibility, that SSDF
>>>>knocked back the absolute numbers a couple of years ago.  Maybe they knocked 100
>>>>points off or something?
>>>>
>>>>Other rating systems, like Thompson's for the PCA, maybe do the rating better
>>>>with less inflation, but they haven't been widely adopted.  Perhaps that's a
>>>>pity.
>>>>
>>>>g
>>>
>>>
>>>In Germany I read an interesting ideas from Detlev Pordzik, aka Elvis, that SSDF
>>>should lower their values to 250 Elo numbers. So that would reduce the maximum
>>>numbers to 2500 and something.
>>>
>>>Again, what I've written hundreds of times, SSDF could do that but the inherited
>>>worst error in SSDF is the testing of machines from DIFFERENT pools! Exitus. The
>>>End.
>>>
>>>Rolf Tueschen
>>
>>
>>You lost me.
>
>Don't say such things without any emergency in sight!
>
>
>>
>>The "pool" the SSDF tests is the pool of computer chess programs, and in that
>>regard, I don't
>>see where they make any mistakes.  Yes, they play games between current programs
>>and old
>>programs.
>
>So you can't see any mistakes. Ok. And what is with the control or the constance
>of the variables? Did you forget that old progs have no learning at all? The
>differences in books? pppppp?!
]

But that is _part_ of the system.  If program A learns and program B does not,
then the
expected win/loss ratio should favor program A.



>
>Bob, next week you'll tel me that the handicapped from the Paralympics could
>well "run" against the US 100 meter athletes! They are from the same pool, no?
>All human species. <cough>

Yes they could compete.   They would lose, and they would be ranked _below_ the
non-handicapped folks of course.  As they should be if you want to compare them
to
each other...





>
>I thought that we (at least) would know that it's making no sense if we test
>several variables free floating at the same time. I mean, what would the results
>tell us? Or is it of great interest for you to receive statistical values for
>the obvious? That old is weaker than new? I mean, isn't it nonsense to prove
>that slow machines are weaker than fast ones?


The only testing flaw in the SSDF that I see has two prongs:  (a) too manyh
games
between two programs;  (b) not enough games against the _entire_ pool of
players.





>
>My God,and you start a debate about inflation? I can't get it into my head
>what's going on here. Can't you see the ugly consequences if you give your
>blessing for such apparent nonsense?

The statistics _demand_ that old programs play new to establish ratings.
Otherwise
you have two separate pools of players and the ratings don't mean anything
across the
pools.




>
>I know - you want to play games on me, right?



not at all...



>
>
>
>
>
>
>
>
>> Yes this tends to inflate their absolute ratings at the top of the
>>list.  But the "pool"
>>is valid, and the ratings do tend to reflect results between any two players in
>>the SSDF pool.
>
>
>And they are valid for what variables please?

To predict the match outcome between any two players in the pool, nothing more.





>
>
>
>>
>>
>>IE if you simply pick _any_ two players on the SSDF list, and compare their
>>ratings, and then
>>play a match between them, their ratings will pretty closely predict the match
>>outcome.  And
>>that is as it should be.
>
>Ae you sure? So you can't sleep before you get the new results? That CST version
>1 on PI is weaker than say Fritz 7 on PIII 2.500? Wow!

That is obvious.  But the question is, "how much weaker"?  And "How much weaker
are the programs we didn't test it against?"  That is what the "rating" is all
about.



>
>
>>
>>The ratings will _not_ predict how the programs will do against programs not in
>>the SSDF list,
>>nor against humans with FIDE ratings that come from a completely separate pool
>>of players...
>
>Of course not, but we are still debating the sense or nonsense of SSDF results.
>Please could you answer my questions?
>
>Rolf Tueschen



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.