Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Comments of latest SSDF list - Nine basic questions

Author: Dann Corbit

Date: 18:07:17 05/31/02

Go up one level in this thread


On May 31, 2002 at 21:00:44, Rolf Tueschen wrote:

>On May 31, 2002 at 20:35:38, Dann Corbit wrote:
>
>>On May 31, 2002 at 20:24:35, Rolf Tueschen wrote:
>>
>>>On May 31, 2002 at 20:02:37, Dann Corbit wrote:
>>>
>>>>On May 31, 2002 at 19:22:27, Rolf Tueschen wrote:
>>>>
>>>>>On May 31, 2002 at 19:01:53, Dann Corbit wrote:
>>>>>
>>>>>>Since people are so often confused about it, it seems a good idea to write a
>>>>>>FAQ.
>>>>>>Rolf's questions could be added, and a search through the CCC archives could
>>>>>>find some more.
>>>>>>
>>>>>>Certainly the games against the old opponents is always a puzzle to newcomers
>>>>>>who do not understand why calibration against an opponent of precisely known
>>>>>>strength is of great value.
>>>>>
>>>>>
>>>>>No pun intended, but excuse me, you can't mean it this way! Are we caught in a
>>>>>new circle? How can the older program be precisely known in its strength?
>>>>>Of course it it isn't! Because it had the same status the new ones have today...
>>>>>
>>>>>And the all the answers from Bertil follow that same fallacious line. It's a
>>>>>pity!
>>>>>
>>>>>Also, what is calibration in SSDF? Comparing the new unknown with the old
>>>>>unknown? No pun inded.
>>>>>
>>>>>Before making such a FAQ let's please find some practical solutions for SSDF.
>>>>
>>>>The older programs have been carefully calibrated by playing many hundreds of
>>>>games.  Hence, their strength in relation to each other and to the other members
>>>>of the pool is very precisely known.
>>>>
>>>>The best possible test you can make is to play an unknown program against the
>>>>best known programs.  This will arrive at an accurate ELO score faster than any
>>>>other way.  Programs that are evenly matched are not as good as programs that
>>>>are somewhat mismatched.  Programs that are terribly mismatched are not as good
>>>>as programs that are somewhat mismatched.
>>>>
>>>>If I have two programs of exactly equal ability, it will take a huge number of
>>>>games to get a good reading on their strength in relation to one another.  On
>>>>the other hand, if one program is 1000 ELO better than another, then one or two
>>>>fluke wins will drastically skew the score.  An ELO difference of 100 to 150 is
>>>>probably just about ideal.
>>>
>>>I don't follow that at all. Perhaps it's too difficult, but I fear that you are
>>>mixing things up. You're arguing as if you _knew_ already that the one program
>>>is 1000 points better. Therefore 2 games are ok for you. But how could you know
>>>this in SSDF? And also, why do you test at all, if it's that simple?
>>
>>No.  You have a group of programs of very well known strength.  The ones that
>>have played the most games are the ones where the strength is precisely known.
>
>I can't accept that.

Mathematics cares nothing about your feelings.

>>Here is a little table:
>>
>>Win expectency for a difference of 0 points is 0.5
>>Win expectency for a difference of 100 points is 0.359935
>>Win expectency for a difference of 200 points is 0.240253
>>Win expectency for a difference of 300 points is 0.15098
>>Win expectency for a difference of 400 points is 0.0909091
>>Win expectency for a difference of 500 points is 0.0532402
>>Win expectency for a difference of 600 points is 0.0306534
>>Win expectency for a difference of 700 points is 0.0174721
>>Win expectency for a difference of 800 points is 0.00990099
>>Win expectency for a difference of 900 points is 0.00559197
>>Win expectency for a difference of 1000 points is 0.00315231
>>
>>Notice that for 1000 ELO difference the win expectency is only .3%.
>
>I see. So, that is the Elo calculation of Elo for human chess, right? What is
>giving you the confidence that it works for computers the same way?

The math does not care at all about the players.  Human, machine, hybrid,
monkey.

>>Therefore, if one thousand games are played between two engines with 1000 ELO
>>difference, any tiny discrepancy will be multiplied.  So if in 1000 games,
>>instead of winning 3 points (as would be expected to the 997 for the better
>>program) 5 points or no points were won it would be a 100 ELO error!
>>
>>Hence, if the program we are testing against is exactly 1000 ELO worse than the
>>one of known strength, we will have problems with accuracy.  The upshot is that
>>it is a tremendous waste of time to play them against each other because very
>>little information is gleaned.
>
>This is all ok.
>
>
>>
>>On the other hand, when the programs are exactly matched, then the win
>>expectancy is that they will be exactly even.  However, because of randomness
>>this is another area of great trouble.  Imagine a coin toss.  It is unlikely
>>that you will get ten heads in a row, but sometimes it happens.  So with exactly
>>matched programs, random walks can cause big inconsistencies.  Therefore, with
>>evenly matched engines it is hard to get an excellent figure for strengths.
>
>
>I see. Here it goes again. You want to get validity through matching tricks. But
>excuse me another time, that won't function! I never heard of such tricky magic.
>You don't have any numbers for strength out of SSDF until now.
>Just a little thought game from my side. I would suppose that the better program
>in computerchess will always have a 100% winning chance or expectancy. The only
>thing that'll disturb this is chess itself. The better prog could have bad luck.
>But it is the better prog nevertheless.

Kasparov can lose to a much weaker player.  But against dozens of weaker players
with hundreds of games it is not going to happen.  Similary for computers.

>Now please take my experiment for a
>moment into your reflections. You see what I mean with the nonsense in SSDF? In
>SSDF you can't differentiate what is strength and what is chess because you have
>no validity. Know what I mean? If yes, please do you explain it for Bertil and
>SSDF?

I am afraid that you aren't going to get it.  I would suggest an elementary
statistics text.

>>On the other, other hand, if the strength differs by 100 ELO or so, a pattern
>>will quickly form.  This is an excellent way to rapidly gain information.
>>
>>>That has nothing to do with correct testing. At first we must  secure that
>>>everyone is treated equally, with equal chances. Each program must have the same
>>>chances to play _exactly_ the same other programs, under the same conditions,
>>>etc.
>>
>>I agree that this is the ideal experimental design.  But we must be very careful
>>not to move the knobs on our experiment or the data becomes worthless.
>
>Apart from the game scores, I'm afraid, the whole SSDF rankings have no meaning
>at all, Dann!

They have meaning for the data set involved under the precise conditions of the
tests.  If they lack meaning for you that simply means that you do not
understand it.

>>  For
>>instance, you mentioned that the books have been updated so why don't we use the
>>new books with the old programs?  The reason is because it completely
>>invalidates our old data!  We have changed a significant variable and now we can
>>draw no conclusions whatsoever about the strength of our new combination.  We
>>will have to calibrate over from scratch.  In addition, it would be necessary
>>for an end-user to have both the new and old version of the program in order to
>>replicate the tests.  Furthermore, there are a huge number of possible hybrid
>>combinations.  Who is going to spend the centuries to test them all?
>
>Hopefully nobody! It doesn't make sense. The only calibration is the one with
>human chess. Otherwise your Elo has no meaning in SSDF. The actual calibrating
>is more like homeopathy.

Are you going to pay the millions of dollars to pay the GM's to play tens of
thousands of games at 40/2 against computers?

>>>But I must apologize if this sounded as if I wanted to teach you stats. You know
>>>that yourself. No?
>>
>>I'm afraid I don't understand this last sentence at all.
>
>I thought that you studies statistics. So I don't like teaching you.

My degree is in Numerical Analysis.  If you can teach me something about
statistics I will be only too happy to learn it.



This page took 0.02 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.