Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: ELO inflation effect ... and SSDF

Author: Rolf Tueschen

Date: 16:05:41 10/10/02

Go up one level in this thread


On October 10, 2002 at 10:01:50, Robert Hyatt wrote:

>On October 09, 2002 at 21:44:41, Rolf Tueschen wrote:
>
>>On October 09, 2002 at 12:59:24, Robert Hyatt wrote:
>>
>>>On October 09, 2002 at 05:54:18, Rolf Tueschen wrote:
>>>
>>>>On October 09, 2002 at 04:52:40, GuyHaworth wrote:
>>>>
>>>>>
>>>>>Totally agreed:  only the differences between the ELO numbers are relevant.
>>>>>
>>>>>I believe there is an inflation effect in the ELO system.  Sadly, investigating
>>>>>this - by theory or simulation - hasn't got to the top of my 'to do' list yet.
>>>>>
>>>>>Anyway, the more games played, the narrower the confidence bands on ELO figures,
>>>>>but the greater the inflation.
>>>>>
>>>>>I believe it was for this reason, or for the sake of credibility, that SSDF
>>>>>knocked back the absolute numbers a couple of years ago.  Maybe they knocked 100
>>>>>points off or something?
>>>>>
>>>>>Other rating systems, like Thompson's for the PCA, maybe do the rating better
>>>>>with less inflation, but they haven't been widely adopted.  Perhaps that's a
>>>>>pity.
>>>>>
>>>>>g
>>>>
>>>>
>>>>In Germany I read an interesting ideas from Detlev Pordzik, aka Elvis, that SSDF
>>>>should lower their values to 250 Elo numbers. So that would reduce the maximum
>>>>numbers to 2500 and something.
>>>>
>>>>Again, what I've written hundreds of times, SSDF could do that but the inherited
>>>>worst error in SSDF is the testing of machines from DIFFERENT pools! Exitus. The
>>>>End.
>>>>
>>>>Rolf Tueschen
>>>
>>>
>>>You lost me.
>>
>>Don't say such things without any emergency in sight!
>>
>>
>>>
>>>The "pool" the SSDF tests is the pool of computer chess programs, and in that
>>>regard, I don't
>>>see where they make any mistakes.  Yes, they play games between current programs
>>>and old
>>>programs.
>>
>>So you can't see any mistakes. Ok. And what is with the control or the constance
>>of the variables? Did you forget that old progs have no learning at all? The
>>differences in books? pppppp?!
>]
>
>But that is _part_ of the system.  If program A learns and program B does not,
>then the
>expected win/loss ratio should favor program A.

Bob, I can't believe it that you are arguing this way. Again, what do they
measure? Elo measured 'strength'! The rest must be held constant. Yes or no?
Elo works with the development of strength in time. One important aspect of the
listing. Did you ever see that e.g. data is taken from Blind chess Exhibitions?
Can't you understand what I'm talking about? Elo is based on tournament chess of
the same species. Red hairs, spectacles, wooden legs are regarded as unimportant
variables. But in computer chess, you can take learning or not learning all in
the same data pool? -- I remind you of the Blind chess data! You get it now?
Excuse me but I find it so strange that you defend the SSDF nonsense.



>
>
>
>>
>>Bob, next week you'll tel me that the handicapped from the Paralympics could
>>well "run" against the US 100 meter athletes! They are from the same pool, no?
>>All human species. <cough>
>
>Yes they could compete.   They would lose, and they would be ranked _below_ the
>non-handicapped folks of course.  As they should be if you want to compare them
>to each other...

You make a huge mistake. Because you can only compare what is "comparable"! You
data must be clean. SSDF is not clean.


>
>
>
>
>
>>
>>I thought that we (at least) would know that it's making no sense if we test
>>several variables free floating at the same time. I mean, what would the results
>>tell us? Or is it of great interest for you to receive statistical values for
>>the obvious? That old is weaker than new? I mean, isn't it nonsense to prove
>>that slow machines are weaker than fast ones?
>
>
>The only testing flaw in the SSDF that I see has two prongs:  (a) too manyh
>games
>between two programs;  (b) not enough games against the _entire_ pool of
>players.

See above...


>
>
>
>
>
>>
>>My God,and you start a debate about inflation? I can't get it into my head
>>what's going on here. Can't you see the ugly consequences if you give your
>>blessing for such apparent nonsense?
>
>The statistics _demand_ that old programs play new to establish ratings.
>Otherwiseyou have two separate pools of players and the ratings don't mean anything across the pools.

Bob, this is getting worse in minutes. Because you want to compare thing you
can't compare you argue that you must play programs of totally different eras.
Which can't be compared - in the variable strength. Isn't that easy to
understand?



>
>
>
>
>>
>>I know - you want to play games on me, right?
>
>
>
>not at all...

I can't believe it. You leave me inconsolable. That can't be true!



>
>
>
>>
>>
>>
>>
>>
>>
>>
>>
>>> Yes this tends to inflate their absolute ratings at the top of the
>>>list.  But the "pool"
>>>is valid, and the ratings do tend to reflect results between any two players in
>>>the SSDF pool.
>>
>>
>>And they are valid for what variables please?
>
>To predict the match outcome between any two players in the pool, nothing more.

So SSDF, the list you once called indispensable, is stating things like at
midday it's 12 o'clock and we eat dinner... Afterwards our weight increased...
Well done!


>
>
>
>
>
>>
>>
>>
>>>
>>>
>>>IE if you simply pick _any_ two players on the SSDF list, and compare their
>>>ratings, and then
>>>play a match between them, their ratings will pretty closely predict the match
>>>outcome.  And
>>>that is as it should be.
>>
>>Ae you sure? So you can't sleep before you get the new results? That CST version
>>1 on PI is weaker than say Fritz 7 on PIII 2.500? Wow!
>
>That is obvious.  But the question is, "how much weaker"?  And "How much weaker
>are the programs we didn't test it against?"  That is what the "rating" is all
>about.


How do you define weakness? Isn't weakness already implied if you don't have
learning tools? And how do you measure the weakness if the machines without
learning get the weakest computers? Could you explain what 'weakness' means in
SSDF?

Rolf Tueschen


>
>
>
>>
>>
>>>
>>>The ratings will _not_ predict how the programs will do against programs not in
>>>the SSDF list,
>>>nor against humans with FIDE ratings that come from a completely separate pool
>>>of players...
>>
>>Of course not, but we are still debating the sense or nonsense of SSDF results.
>>Please could you answer my questions?
>>
>>Rolf Tueschen



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.