Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Comments of latest SSDF list - Nine basic questions

Author: Rolf Tueschen

Date: 18:00:44 05/31/02

Go up one level in this thread


On May 31, 2002 at 20:35:38, Dann Corbit wrote:

>On May 31, 2002 at 20:24:35, Rolf Tueschen wrote:
>
>>On May 31, 2002 at 20:02:37, Dann Corbit wrote:
>>
>>>On May 31, 2002 at 19:22:27, Rolf Tueschen wrote:
>>>
>>>>On May 31, 2002 at 19:01:53, Dann Corbit wrote:
>>>>
>>>>>Since people are so often confused about it, it seems a good idea to write a
>>>>>FAQ.
>>>>>Rolf's questions could be added, and a search through the CCC archives could
>>>>>find some more.
>>>>>
>>>>>Certainly the games against the old opponents is always a puzzle to newcomers
>>>>>who do not understand why calibration against an opponent of precisely known
>>>>>strength is of great value.
>>>>
>>>>
>>>>No pun intended, but excuse me, you can't mean it this way! Are we caught in a
>>>>new circle? How can the older program be precisely known in its strength?
>>>>Of course it it isn't! Because it had the same status the new ones have today...
>>>>
>>>>And the all the answers from Bertil follow that same fallacious line. It's a
>>>>pity!
>>>>
>>>>Also, what is calibration in SSDF? Comparing the new unknown with the old
>>>>unknown? No pun inded.
>>>>
>>>>Before making such a FAQ let's please find some practical solutions for SSDF.
>>>
>>>The older programs have been carefully calibrated by playing many hundreds of
>>>games.  Hence, their strength in relation to each other and to the other members
>>>of the pool is very precisely known.
>>>
>>>The best possible test you can make is to play an unknown program against the
>>>best known programs.  This will arrive at an accurate ELO score faster than any
>>>other way.  Programs that are evenly matched are not as good as programs that
>>>are somewhat mismatched.  Programs that are terribly mismatched are not as good
>>>as programs that are somewhat mismatched.
>>>
>>>If I have two programs of exactly equal ability, it will take a huge number of
>>>games to get a good reading on their strength in relation to one another.  On
>>>the other hand, if one program is 1000 ELO better than another, then one or two
>>>fluke wins will drastically skew the score.  An ELO difference of 100 to 150 is
>>>probably just about ideal.
>>
>>I don't follow that at all. Perhaps it's too difficult, but I fear that you are
>>mixing things up. You're arguing as if you _knew_ already that the one program
>>is 1000 points better. Therefore 2 games are ok for you. But how could you know
>>this in SSDF? And also, why do you test at all, if it's that simple?
>
>No.  You have a group of programs of very well known strength.  The ones that
>have played the most games are the ones where the strength is precisely known.

I can't accept that.

>
>Here is a little table:
>
>Win expectency for a difference of 0 points is 0.5
>Win expectency for a difference of 100 points is 0.359935
>Win expectency for a difference of 200 points is 0.240253
>Win expectency for a difference of 300 points is 0.15098
>Win expectency for a difference of 400 points is 0.0909091
>Win expectency for a difference of 500 points is 0.0532402
>Win expectency for a difference of 600 points is 0.0306534
>Win expectency for a difference of 700 points is 0.0174721
>Win expectency for a difference of 800 points is 0.00990099
>Win expectency for a difference of 900 points is 0.00559197
>Win expectency for a difference of 1000 points is 0.00315231
>
>Notice that for 1000 ELO difference the win expectency is only .3%.

I see. So, that is the Elo calculation of Elo for human chess, right? What is
giving you the confidence that it works for computers the same way?


>Therefore, if one thousand games are played between two engines with 1000 ELO
>difference, any tiny discrepancy will be multiplied.  So if in 1000 games,
>instead of winning 3 points (as would be expected to the 997 for the better
>program) 5 points or no points were won it would be a 100 ELO error!
>
>Hence, if the program we are testing against is exactly 1000 ELO worse than the
>one of known strength, we will have problems with accuracy.  The upshot is that
>it is a tremendous waste of time to play them against each other because very
>little information is gleaned.

This is all ok.


>
>On the other hand, when the programs are exactly matched, then the win
>expectancy is that they will be exactly even.  However, because of randomness
>this is another area of great trouble.  Imagine a coin toss.  It is unlikely
>that you will get ten heads in a row, but sometimes it happens.  So with exactly
>matched programs, random walks can cause big inconsistencies.  Therefore, with
>evenly matched engines it is hard to get an excellent figure for strengths.


I see. Here it goes again. You want to get validity through matching tricks. But
excuse me another time, that won't function! I never heard of such tricky magic.
You don't have any numbers for strength out of SSDF until now.
Just a little thought game from my side. I would suppose that the better program
in computerchess will always have a 100% winning chance or expectancy. The only
thing that'll disturb this is chess itself. The better prog could have bad luck.
But it is the better prog nevertheless. Now please take my experiment for a
moment into your reflections. You see what I mean with the nonsense in SSDF? In
SSDF you can't differentiate what is strength and what is chess because you have
no validity. Know what I mean? If yes, please do you explain it for Bertil and
SSDF?


>
>On the other, other hand, if the strength differs by 100 ELO or so, a pattern
>will quickly form.  This is an excellent way to rapidly gain information.
>
>>That has nothing to do with correct testing. At first we must  secure that
>>everyone is treated equally, with equal chances. Each program must have the same
>>chances to play _exactly_ the same other programs, under the same conditions,
>>etc.
>
>I agree that this is the ideal experimental design.  But we must be very careful
>not to move the knobs on our experiment or the data becomes worthless.

Apart from the game scores, I'm afraid, the whole SSDF rankings have no meaning
at all, Dann!



>  For
>instance, you mentioned that the books have been updated so why don't we use the
>new books with the old programs?  The reason is because it completely
>invalidates our old data!  We have changed a significant variable and now we can
>draw no conclusions whatsoever about the strength of our new combination.  We
>will have to calibrate over from scratch.  In addition, it would be necessary
>for an end-user to have both the new and old version of the program in order to
>replicate the tests.  Furthermore, there are a huge number of possible hybrid
>combinations.  Who is going to spend the centuries to test them all?

Hopefully nobody! It doesn't make sense. The only calibration is the one with
human chess. Otherwise your Elo has no meaning in SSDF. The actual calibrating
is more like homeopathy.


>
>>But I must apologize if this sounded as if I wanted to teach you stats. You know
>>that yourself. No?
>
>I'm afraid I don't understand this last sentence at all.

I thought that you studies statistics. So I don't like teaching you.

Rolf Tueschen



This page took 0.03 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.