Author: Dann Corbit
Date: 18:07:17 05/31/02
Go up one level in this thread
On May 31, 2002 at 21:00:44, Rolf Tueschen wrote: >On May 31, 2002 at 20:35:38, Dann Corbit wrote: > >>On May 31, 2002 at 20:24:35, Rolf Tueschen wrote: >> >>>On May 31, 2002 at 20:02:37, Dann Corbit wrote: >>> >>>>On May 31, 2002 at 19:22:27, Rolf Tueschen wrote: >>>> >>>>>On May 31, 2002 at 19:01:53, Dann Corbit wrote: >>>>> >>>>>>Since people are so often confused about it, it seems a good idea to write a >>>>>>FAQ. >>>>>>Rolf's questions could be added, and a search through the CCC archives could >>>>>>find some more. >>>>>> >>>>>>Certainly the games against the old opponents is always a puzzle to newcomers >>>>>>who do not understand why calibration against an opponent of precisely known >>>>>>strength is of great value. >>>>> >>>>> >>>>>No pun intended, but excuse me, you can't mean it this way! Are we caught in a >>>>>new circle? How can the older program be precisely known in its strength? >>>>>Of course it it isn't! Because it had the same status the new ones have today... >>>>> >>>>>And the all the answers from Bertil follow that same fallacious line. It's a >>>>>pity! >>>>> >>>>>Also, what is calibration in SSDF? Comparing the new unknown with the old >>>>>unknown? No pun inded. >>>>> >>>>>Before making such a FAQ let's please find some practical solutions for SSDF. >>>> >>>>The older programs have been carefully calibrated by playing many hundreds of >>>>games. Hence, their strength in relation to each other and to the other members >>>>of the pool is very precisely known. >>>> >>>>The best possible test you can make is to play an unknown program against the >>>>best known programs. This will arrive at an accurate ELO score faster than any >>>>other way. Programs that are evenly matched are not as good as programs that >>>>are somewhat mismatched. Programs that are terribly mismatched are not as good >>>>as programs that are somewhat mismatched. >>>> >>>>If I have two programs of exactly equal ability, it will take a huge number of >>>>games to get a good reading on their strength in relation to one another. On >>>>the other hand, if one program is 1000 ELO better than another, then one or two >>>>fluke wins will drastically skew the score. An ELO difference of 100 to 150 is >>>>probably just about ideal. >>> >>>I don't follow that at all. Perhaps it's too difficult, but I fear that you are >>>mixing things up. You're arguing as if you _knew_ already that the one program >>>is 1000 points better. Therefore 2 games are ok for you. But how could you know >>>this in SSDF? And also, why do you test at all, if it's that simple? >> >>No. You have a group of programs of very well known strength. The ones that >>have played the most games are the ones where the strength is precisely known. > >I can't accept that. Mathematics cares nothing about your feelings. >>Here is a little table: >> >>Win expectency for a difference of 0 points is 0.5 >>Win expectency for a difference of 100 points is 0.359935 >>Win expectency for a difference of 200 points is 0.240253 >>Win expectency for a difference of 300 points is 0.15098 >>Win expectency for a difference of 400 points is 0.0909091 >>Win expectency for a difference of 500 points is 0.0532402 >>Win expectency for a difference of 600 points is 0.0306534 >>Win expectency for a difference of 700 points is 0.0174721 >>Win expectency for a difference of 800 points is 0.00990099 >>Win expectency for a difference of 900 points is 0.00559197 >>Win expectency for a difference of 1000 points is 0.00315231 >> >>Notice that for 1000 ELO difference the win expectency is only .3%. > >I see. So, that is the Elo calculation of Elo for human chess, right? What is >giving you the confidence that it works for computers the same way? The math does not care at all about the players. Human, machine, hybrid, monkey. >>Therefore, if one thousand games are played between two engines with 1000 ELO >>difference, any tiny discrepancy will be multiplied. So if in 1000 games, >>instead of winning 3 points (as would be expected to the 997 for the better >>program) 5 points or no points were won it would be a 100 ELO error! >> >>Hence, if the program we are testing against is exactly 1000 ELO worse than the >>one of known strength, we will have problems with accuracy. The upshot is that >>it is a tremendous waste of time to play them against each other because very >>little information is gleaned. > >This is all ok. > > >> >>On the other hand, when the programs are exactly matched, then the win >>expectancy is that they will be exactly even. However, because of randomness >>this is another area of great trouble. Imagine a coin toss. It is unlikely >>that you will get ten heads in a row, but sometimes it happens. So with exactly >>matched programs, random walks can cause big inconsistencies. Therefore, with >>evenly matched engines it is hard to get an excellent figure for strengths. > > >I see. Here it goes again. You want to get validity through matching tricks. But >excuse me another time, that won't function! I never heard of such tricky magic. >You don't have any numbers for strength out of SSDF until now. >Just a little thought game from my side. I would suppose that the better program >in computerchess will always have a 100% winning chance or expectancy. The only >thing that'll disturb this is chess itself. The better prog could have bad luck. >But it is the better prog nevertheless. Kasparov can lose to a much weaker player. But against dozens of weaker players with hundreds of games it is not going to happen. Similary for computers. >Now please take my experiment for a >moment into your reflections. You see what I mean with the nonsense in SSDF? In >SSDF you can't differentiate what is strength and what is chess because you have >no validity. Know what I mean? If yes, please do you explain it for Bertil and >SSDF? I am afraid that you aren't going to get it. I would suggest an elementary statistics text. >>On the other, other hand, if the strength differs by 100 ELO or so, a pattern >>will quickly form. This is an excellent way to rapidly gain information. >> >>>That has nothing to do with correct testing. At first we must secure that >>>everyone is treated equally, with equal chances. Each program must have the same >>>chances to play _exactly_ the same other programs, under the same conditions, >>>etc. >> >>I agree that this is the ideal experimental design. But we must be very careful >>not to move the knobs on our experiment or the data becomes worthless. > >Apart from the game scores, I'm afraid, the whole SSDF rankings have no meaning >at all, Dann! They have meaning for the data set involved under the precise conditions of the tests. If they lack meaning for you that simply means that you do not understand it. >> For >>instance, you mentioned that the books have been updated so why don't we use the >>new books with the old programs? The reason is because it completely >>invalidates our old data! We have changed a significant variable and now we can >>draw no conclusions whatsoever about the strength of our new combination. We >>will have to calibrate over from scratch. In addition, it would be necessary >>for an end-user to have both the new and old version of the program in order to >>replicate the tests. Furthermore, there are a huge number of possible hybrid >>combinations. Who is going to spend the centuries to test them all? > >Hopefully nobody! It doesn't make sense. The only calibration is the one with >human chess. Otherwise your Elo has no meaning in SSDF. The actual calibrating >is more like homeopathy. Are you going to pay the millions of dollars to pay the GM's to play tens of thousands of games at 40/2 against computers? >>>But I must apologize if this sounded as if I wanted to teach you stats. You know >>>that yourself. No? >> >>I'm afraid I don't understand this last sentence at all. > >I thought that you studies statistics. So I don't like teaching you. My degree is in Numerical Analysis. If you can teach me something about statistics I will be only too happy to learn it.
This page took 0.02 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.