Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Comments of latest SSDF list - Nine basic questions

Author: Dann Corbit

Date: 17:39:35 06/03/02

Go up one level in this thread


On June 03, 2002 at 19:53:02, Rolf Tueschen wrote:

>On June 03, 2002 at 17:42:53, Dann Corbit wrote:
>
>>On June 01, 2002 at 15:49:24, Rolf Tueschen wrote:
>>
>>>On May 31, 2002 at 21:07:17, Dann Corbit wrote:
>>>
>>>>On May 31, 2002 at 21:00:44, Rolf Tueschen wrote:
>>>>
>>>>>On May 31, 2002 at 20:35:38, Dann Corbit wrote:
>>>>>
>>>>>>On May 31, 2002 at 20:24:35, Rolf Tueschen wrote:
>>>>>>
>>>>>>>On May 31, 2002 at 20:02:37, Dann Corbit wrote:
>>>>>>>
>>>>>>>>On May 31, 2002 at 19:22:27, Rolf Tueschen wrote:
>>>>>>>>
>>>>>>>>>On May 31, 2002 at 19:01:53, Dann Corbit wrote:
>>>>>>>>>
>>>>>>>>>>Since people are so often confused about it, it seems a good idea to write a
>>>>>>>>>>FAQ.
>>>>>>>>>>Rolf's questions could be added, and a search through the CCC archives could
>>>>>>>>>>find some more.
>>>>>>>>>>
>>>>>>>>>>Certainly the games against the old opponents is always a puzzle to newcomers
>>>>>>>>>>who do not understand why calibration against an opponent of precisely known
>>>>>>>>>>strength is of great value.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>No pun intended, but excuse me, you can't mean it this way! Are we caught in a
>>>>>>>>>new circle? How can the older program be precisely known in its strength?
>>>>>>>>>Of course it it isn't! Because it had the same status the new ones have today...
>>>>>>>>>
>>>>>>>>>And the all the answers from Bertil follow that same fallacious line. It's a
>>>>>>>>>pity!
>>>>>>>>>
>>>>>>>>>Also, what is calibration in SSDF? Comparing the new unknown with the old
>>>>>>>>>unknown? No pun inded.
>>>>>>>>>
>>>>>>>>>Before making such a FAQ let's please find some practical solutions for SSDF.
>>>>>>>>
>>>>>>>>The older programs have been carefully calibrated by playing many hundreds of
>>>>>>>>games.  Hence, their strength in relation to each other and to the other members
>>>>>>>>of the pool is very precisely known.
>>>>>>>>
>>>>>>>>The best possible test you can make is to play an unknown program against the
>>>>>>>>best known programs.  This will arrive at an accurate ELO score faster than any
>>>>>>>>other way.  Programs that are evenly matched are not as good as programs that
>>>>>>>>are somewhat mismatched.  Programs that are terribly mismatched are not as good
>>>>>>>>as programs that are somewhat mismatched.
>>>>>>>>
>>>>>>>>If I have two programs of exactly equal ability, it will take a huge number of
>>>>>>>>games to get a good reading on their strength in relation to one another.  On
>>>>>>>>the other hand, if one program is 1000 ELO better than another, then one or two
>>>>>>>>fluke wins will drastically skew the score.  An ELO difference of 100 to 150 is
>>>>>>>>probably just about ideal.
>>>>>>>
>>>>>>>I don't follow that at all. Perhaps it's too difficult, but I fear that you are
>>>>>>>mixing things up. You're arguing as if you _knew_ already that the one program
>>>>>>>is 1000 points better. Therefore 2 games are ok for you. But how could you know
>>>>>>>this in SSDF? And also, why do you test at all, if it's that simple?
>>>>>>
>>>>>>No.  You have a group of programs of very well known strength.  The ones that
>>>>>>have played the most games are the ones where the strength is precisely known.
>>>>>
>>>>>I can't accept that.
>>>>
>>>>Mathematics cares nothing about your feelings.
>>>
>>>Dann Corbit, will you please realize that maths won't help the validity loch!
>>>;-)
>>>
>>>>
>>>>>>Here is a little table:
>>>>>>
>>>>>>Win expectency for a difference of 0 points is 0.5
>>>>>>Win expectency for a difference of 100 points is 0.359935
>>>>>>Win expectency for a difference of 200 points is 0.240253
>>>>>>Win expectency for a difference of 300 points is 0.15098
>>>>>>Win expectency for a difference of 400 points is 0.0909091
>>>>>>Win expectency for a difference of 500 points is 0.0532402
>>>>>>Win expectency for a difference of 600 points is 0.0306534
>>>>>>Win expectency for a difference of 700 points is 0.0174721
>>>>>>Win expectency for a difference of 800 points is 0.00990099
>>>>>>Win expectency for a difference of 900 points is 0.00559197
>>>>>>Win expectency for a difference of 1000 points is 0.00315231
>>>>>>
>>>>>>Notice that for 1000 ELO difference the win expectency is only .3%.
>>>>>
>>>>>I see. So, that is the Elo calculation of Elo for human chess, right? What is
>>>>>giving you the confidence that it works for computers the same way?
>>>>
>>>>The math does not care at all about the players.  Human, machine, hybrid,
>>>>monkey.
>>>
>>>Sure? And what is if strength is _not_ following the socalled normal
>>>distribution for machine, hybrid and an awful lot in monkeys? Maths is one thing
>>>and reflection another (if and how maths should be started).
>>>
>>>
>>>>
>>>>>>Therefore, if one thousand games are played between two engines with 1000 ELO
>>>>>>difference, any tiny discrepancy will be multiplied.  So if in 1000 games,
>>>>>>instead of winning 3 points (as would be expected to the 997 for the better
>>>>>>program) 5 points or no points were won it would be a 100 ELO error!
>>>>>>
>>>>>>Hence, if the program we are testing against is exactly 1000 ELO worse than the
>>>>>>one of known strength, we will have problems with accuracy.  The upshot is that
>>>>>>it is a tremendous waste of time to play them against each other because very
>>>>>>little information is gleaned.
>>>>>
>>>>>This is all ok.
>>>>>
>>>>>
>>>>>>
>>>>>>On the other hand, when the programs are exactly matched, then the win
>>>>>>expectancy is that they will be exactly even.  However, because of randomness
>>>>>>this is another area of great trouble.  Imagine a coin toss.  It is unlikely
>>>>>>that you will get ten heads in a row, but sometimes it happens.  So with exactly
>>>>>>matched programs, random walks can cause big inconsistencies.  Therefore, with
>>>>>>evenly matched engines it is hard to get an excellent figure for strengths.
>>>>>
>>>>>
>>>>>I see. Here it goes again. You want to get validity through matching tricks. But
>>>>>excuse me another time, that won't function! I never heard of such tricky magic.
>>>>>You don't have any numbers for strength out of SSDF until now.
>>>>>Just a little thought game from my side. I would suppose that the better program
>>>>>in computerchess will always have a 100% winning chance or expectancy. The only
>>>>>thing that'll disturb this is chess itself. The better prog could have bad luck.
>>>>>But it is the better prog nevertheless.
>>>>
>>>>Kasparov can lose to a much weaker player.  But against dozens of weaker players
>>>>with hundreds of games it is not going to happen.  Similary for computers.
>>>
>>>
>>>Sure? And what is if the behaviour of machines is deterministic?
>>>
>>>21, 22, 23... Bingo!
>>>
>>>That is why I'm talking about the fallacies in SSDF! It's uninteresting. In
>>>human chess however we could still wait for the exceptions. But Kasparov still
>>>won't lose against a _much_ weaker human player. Against a machine, yes,
>>>perhaps. :)
>>
>>If the behaviour of machines were perfectly deterministic, then they would
>>always play the same games, over and over.  Program 'A' (if it won a single
>>time) would win *every* game against program 'B' because they would be in a
>>lock-step identical dance on every game.
>
>You must not expect idendity on each appearance when determinism is existing. It
>is chess. And even with learning it still remains deterministic. Ok, Bob once
>told me that DB2 was not deterministic because you could not repeat a certain
>move because of the multi-parallelism, but it's very probable that it would
>still be deterministic - only we had no time to do our research on the machine.
>:((
>
>
>
>>
>>To solve this problem, computer programmers introduce randomness.
>>
>>Usually, it starts like this...
>>
>>...
>>srand((unsigned)time(NULL));
>>...
>>
>>The above call uses the system clock to get a new seed for the random number
>>generator.  This means that a different sequence of random numbers will be used
>>each time someone runs the program (actually, probably only 4 billion different
>>sequences, but the hard drives will run out long before they are exhausted).
>>
>>Then, as the program is played, decisions will be made with calls to the rand()
>>function with assigned probablility weights as a function of goodness.
>>Hence, from a group of (perhaps) the top 5 moves, the one that looks best will
>>happen 75% of the time, but the others will happen sometimes too.  Now, as we go
>>from move to move if the other moves even have a 5% chance of happening, then
>>the chance of repeating a long game is basically zero.
>
>And you mean that then determinism couldn't exist?

No.  I mean that the program is not going to play the same way each time.  There
will be patterns and deficits and all sorts of interesting quirks with its play.

>>In addition, computer chess programs often learn as they play.  Not smart
>>learning of principles like humans do, but rote memorization of mistakes.  So
>>they won't make the same mistakes over and over.  That is one reason why GM's
>>access to computers won't mean instant destruction for the computers.
>
>Depends of what you are calling mistake. The minute GMs would develop
>a really deep anti-computer strategy, the bells are ringing over the
>deterministic computerchess.
>As I wrote, not because of some opening lines. But because of the lack of
>understanding deeper strategies.
>And you will agree that you can't just add a few learning features. Because this
>would be a contradiction.
>But don't worry, GMs won't come running to participate.

GM's performances against computers recently have been rather dissapointing.
Quite frankly, I am not sure who is exposing the weaknesses of the other more.

>>
>>>>>Now please take my experiment for a
>>>>>moment into your reflections. You see what I mean with the nonsense in SSDF? In
>>>>>SSDF you can't differentiate what is strength and what is chess because you have
>>>>>no validity. Know what I mean? If yes, please do you explain it for Bertil and
>>>>>SSDF?
>>>>
>>>>I am afraid that you aren't going to get it.  I would suggest an elementary
>>>>statistics text.
>>>
>>>Ok, we can stop it at the instant. Just say that word, please. You have the
>>>power! But until then I'll claim that you haven't got what I'm talking about,
>>>speaking about validity! Either tell me please where you see validity in SSDF or
>>>stop please the continual applause for SSDF. It's not justified.
>>
>>The model is mathematically valid.  You don't understand that.  Fine.  A model
>>is valid if it can predict outcomes.  In fact, the model accurately predicts
>>outcomes on a broad basis.  In other words, if one program is 100 ELO above some
>>other programs, it will win about 64% of the points in a very long match against
>>the weaker groups.  For this purpose, the SSDF results are (in fact) most
>>excellent.
>
>But what is excellent in your eyes? And where is a program 100 points higher
>than another in SSDF, I fear I do not get it.
>The 64% is _not_ saying anything. And you can never say +100 in SSDF. Guess why?
>(Calibration!)

The SSDF data is carefully calibrated against itself.  There is no other meaning
to calibration in the ELO system.  You simply do not understand how it works.

>>The SSDF results do not predict how the machines will do against
>>people.
>
>And why? *g*
>Because the results are not valid, they have no meaning.

If you think that the results are invalid, then you don't understand what the
SSDF is or what they are trying to do.  The math is (without question) far
better than most attempts at this sort of thing and the data is much more robust
than (for instance) that of FIDE.  The error bars are dependable and the results
will be repeatable.  I think perhaps your difficulty with the data is that they
do not represent what you would like them to mean.  That is not an error in the
experiment but (rather) a design choice.  Of course, you can always make your
own experiment and gather the dozens of volunteers needed to complete the
experiments.

>> On the other hand, we can deduce that there will be *some* sort of
>>correlation between computer/computer strength and computer/human strength.
>
>Are you sure or are you just talking about the actual situation with the show
>events.
>Dann, I beg you please, this is all bogus. Has no meaning _at all_. 	 mean are
>we talking about PR or what really could be said?

It has no meaning to those who do not understand the data.  It has excellent
meaning to those who do understand the data.

>>Unfortunately, we cannot say what that correlation is without expermentations
>>and data.  We can only guess.
>
>Yes, and show events are no experiments.
>
>
>>
>>The SSDF list produces this:
>>If you play a sequence of computer programs taken from the SSDF list under the
>>exact conditions of the SSDF matches, you will get similar behavior.
>
>If you are saying that we would get the same irrelevant data, then I can agree
>without efforts.

It is irrelevant if it is used to try to show something that is not inferred by
the data.  I would not use it for that purpose.  For my needs, the data is
highly relevant and interesting.

>>If anyone thinks it produces more than that, they are mistaken.  We can suppose
>>that correlations of computer strength translate to games against humans, but
>>that is an untested hypothesis.
>
>And that the rank 1 on SSDF is a real rank number 1, is a fallacy, yes, we knew
>that since I wrote about it.

It means exactly what the data says.  Your problem with the data is you want it
to mean something else (I think) than it really does.  Therefore, you do not
like the results.  The data does not show [for instance] which program will be
strongest against a GM or even how the programs will run on any sort of system
besides the exact computers used and besides the exact conditions of play
(auto-232).

>>
>>>
>>>>
>>>>>>On the other, other hand, if the strength differs by 100 ELO or so, a pattern
>>>>>>will quickly form.  This is an excellent way to rapidly gain information.
>>>>>>
>>>>>>>That has nothing to do with correct testing. At first we must  secure that
>>>>>>>everyone is treated equally, with equal chances. Each program must have the same
>>>>>>>chances to play _exactly_ the same other programs, under the same conditions,
>>>>>>>etc.
>>>>>>
>>>>>>I agree that this is the ideal experimental design.  But we must be very careful
>>>>>>not to move the knobs on our experiment or the data becomes worthless.
>>>>>
>>>>>Apart from the game scores, I'm afraid, the whole SSDF rankings have no meaning
>>>>>at all, Dann!
>>>>
>>>>They have meaning for the data set involved under the precise conditions of the
>>>>tests.  If they lack meaning for you that simply means that you do not
>>>>understand it.
>>>
>>>Again, this is not _my_ invention. Without validity you have nothing at all, but
>>>fine rankings, yes, it's looking nice.
>>
>>You do not understand what you are saying.  Or the mathematics escapes you.  In
>>any case, there are no difficulties with the validity of the SSDF data beyond
>>what is normally seen in experimental setups.
>
>Really, Dann. Did you ever begin with such a statistical experiment? What was
>the first what you have done??

I have been published in scientific papers, if that is what you are asking.

>Statistics isn't just the final calculation! So you have two possibilities. a)
>you did it already - then you know by force what you have done before the
>beginning. Uhmm, I'm not talking about living room championships,
>I'm talking about statistics. Also I'm not talking about the final results of
>statistics in newspapers. (etc.)

There are lies, damn lies, and statistics.  The big problem with statistics is
that 99% of the world's population has no idea what they might possibly mean.
Therefore, when they see them, they draw all sorts of incorrect assertions from
them.

>>>>>>  For
>>>>>>instance, you mentioned that the books have been updated so why don't we use the
>>>>>>new books with the old programs?  The reason is because it completely
>>>>>>invalidates our old data!  We have changed a significant variable and now we can
>>>>>>draw no conclusions whatsoever about the strength of our new combination.  We
>>>>>>will have to calibrate over from scratch.  In addition, it would be necessary
>>>>>>for an end-user to have both the new and old version of the program in order to
>>>>>>replicate the tests.  Furthermore, there are a huge number of possible hybrid
>>>>>>combinations.  Who is going to spend the centuries to test them all?
>>>>>
>>>>>Hopefully nobody! It doesn't make sense. The only calibration is the one with
>>>>>human chess. Otherwise your Elo has no meaning in SSDF. The actual calibrating
>>>>>is more like homeopathy.
>>>>
>>>>Are you going to pay the millions of dollars to pay the GM's to play tens of
>>>>thousands of games at 40/2 against computers?
>>>
>>>This is, excuse me, nonsense. BTW the companies will do what they can, but they
>>>won't pay for the revelation that their progs are just average masters, but not
>>>IM neither GM. (Please read my contribution to Andrew Dados' joke. I'm talking
>>>about real group-like fight GM vs comp to develop real anti-computerchess, not
>>>just a few cooked lines.)
>>
>>This is an interesting hypothesis and it certainly has merit.  However, there is
>>no connection whatever to my statement.  If they *DID* run the experiments then
>>we would get good data.  I make no statements, guesses or extrapolations as to
>>whether anyone ever will actually do it.
>
>Of course not. That won't happen in our life time.
>
>>
>>>But here is how I would advise SSDF to proceed. Invite a good human expert with
>>>comp experiences. He has a defined Elo. Let him play the new progs, but I'm
>>>talking about the "hard" version of play.
>>
>>Why only anti-comp experts and not a broad field?  Don't you think this will
>>skew the result?
>
>Excuse me, yes. But we must define skew. It would skew the skewed results, yes.
>:)
>What the SSDF is getting, that is skewed although it looks so nice. That is what
>I'm trying to explain.
>I say that if we could make my thought experiment, we would get 2350 and the
>SSDF would stilldream of 2650. Now, what is skewed? You must decide.

Certainly not.  There is no connection between the experiments unless you create
one.  Therefore, the absolute numbers have no meaning.  That is the very nature
of the ELO system and if you do not understand that then you do not know how the
ELO system works.  For instance, if you understood the ELO system, then you
would immediately understand that these two lists are identical:


    Program            Elo    +   -   Games   Score   Av.Op.  Draws

  1 LG2000V3         : 2589   97 197    31    77.4 %   2375    6.5 %
  2 Yace 0.99.50     : 2586   31 104   188    86.2 %   2268   12.8 %
  3 MAD-005          : 2583   37 110   145    83.1 %   2306   10.3 %
  4 Crafty-18.10     : 2580  102 138    29    75.9 %   2381   27.6 %
  5 Comet-B37        : 2554  103 129    31    71.0 %   2398   25.8 %
  6 TCBishop-4601    : 2542  103 165    30    73.3 %   2367   13.3 %
  7 Gromit3          : 2527   99 100    36    66.7 %   2407   33.3 %
  8 Nejmet-260       : 2526  109 163    28    71.4 %   2367   14.3 %
  9 Phalanx-xxii     : 2522  102 153    33    68.2 %   2390    9.1 %
 10 AnMon-509        : 2518   33  83   195    81.0 %   2266   14.4 %
 11 Amy-07           : 2507   36 106   157    82.5 %   2238    9.6 %
 12 TCBishop-0045    : 2503  112 130    29    67.2 %   2378   24.1 %
 13 AnMon-510        : 2497   36  98   162    81.5 %   2240   11.1 %
 14 ZChess-222       : 2492   31  75   230    78.9 %   2263   12.6 %
 15 GLC-213          : 2469  112 120    33    60.6 %   2395   18.2 %
 16 ZChess-120       : 2452   35  70   194    74.7 %   2264   14.4 %
 17 Gromit2          : 2434  125  81    32    53.1 %   2412   43.8 %
 18 Pepito-121       : 2432  126 117    28    58.9 %   2369   25.0 %
 19 Ant-606          : 2429  110 110    37    56.8 %   2382   16.2 %
 20 FranWB-090       : 2427   36  63   202    71.5 %   2267   15.3 %

    Program            Elo    +   -   Games   Score   Av.Op.  Draws

  1 LG2000V3         :  289   97 197    31    77.4 %     75    6.5 %
  2 Yace 0.99.50     :  286   31 104   188    86.2 %    -32   12.8 %
  3 MAD-005          :  283   37 110   145    83.1 %      6   10.3 %
  4 Crafty-18.10     :  280  102 138    29    75.9 %     81   27.6 %
  5 Comet-B37        :  254  103 129    31    71.0 %     98   25.8 %
  6 TCBishop-4601    :  242  103 165    30    73.3 %     67   13.3 %
  7 Gromit3          :  227   99 100    36    66.7 %    107   33.3 %
  8 Nejmet-260       :  226  109 163    28    71.4 %     67   14.3 %
  9 Phalanx-xxii     :  222  102 153    33    68.2 %     90    9.1 %
 10 AnMon-509        :  218   33  83   195    81.0 %    -34   14.4 %
 11 Amy-07           :  207   36 106   157    82.5 %    -62    9.6 %
 12 TCBishop-0045    :  203  112 130    29    67.2 %     78   24.1 %
 13 AnMon-510        :  197   36  98   162    81.5 %    -60   11.1 %
 14 ZChess-222       :  192   31  75   230    78.9 %    -37   12.6 %
 15 GLC-213          :  169  112 120    33    60.6 %     95   18.2 %
 16 ZChess-120       :  152   35  70   194    74.7 %    -36   14.4 %
 17 Gromit2          :  134  125  81    32    53.1 %    112   43.8 %
 18 Pepito-121       :  132  126 117    28    58.9 %     69   25.0 %
 19 Ant-606          :  129  110 110    37    56.8 %     82   16.2 %
 20 FranWB-090       :  127   36  63   202    71.5 %    -33   15.3 %

Notice (however) that the highest ELO is 2589 in the first list and 289 in the
second.  Yet this is irrelevant.  The only thing that matters are the
differences.

>>>Not PR bogus! Then they have +/-
>>>something their Elo. So, all players all over Sweden are invited to take part.
>>>Comp vs comp is bogus and won't produce valid Elo numbers, BTW no matter how
>>>much genius you might put into your broadness debate about pools...
>>
>>They produce perfectly valid comp/comp numbers.  They do not produce comp/human
>>correlations.
>
>No, they do _not_. Their 2650. What does it mean? Did you forget the problem of
>calibrating?

It means that if you take identical machines and identical programs and run them
under identical circumstances, you will get similar results (within the confines
of experimental uncertainty).  What else would it mean?  The number 2650 is as
good or bad as any other.  Draw a number out of a hat and use it if you like.
Do you expect it to correspond to FIDE or USCF or BCF or FICS or any other
rating system that ranks a different pool of players?  It would be silly to
think so.  {Of course, there is probably some connection between strength in
different pools, but experiments would have to be carried out to find out what
the correlation is [if any]}

>>>>
>>>>>>>But I must apologize if this sounded as if I wanted to teach you stats. You know
>>>>>>>that yourself. No?
>>>>>>
>>>>>>I'm afraid I don't understand this last sentence at all.
>>>>>
>>>>>I thought that you studies statistics. So I don't like teaching you.
>>>>
>>>>My degree is in Numerical Analysis.  If you can teach me something about
>>>>statistics I will be only too happy to learn it.
>>>
>>>I already tried. I tried to explain, how important the reflections are prior to
>>>mere calculating in stats. I'm talking about statistical design. If there you
>>>begin with nonsense then the best maths (and maths itself is always GOOD!) won't
>>>help you out of the mess. That is the important thing to understand if you want
>>>to learn something about stats. The collection of hundreds of formulas is not
>>>really important. But most people are confusing the topics. Once someone wrote
>>>to me on a posting: but maths is maths, and if it functions in human chess then
>>>it's also functioning in computerchess. Well, this is simply wrong. Of course
>>>it's functioning but what is the meaning of the results? That is the question.
>>>
>>>Please do not take me for arrogant if it may sound so. For me this is so trivial
>>>that I can't explain it in the didactically best way.
>>
>>I find the discussions with you both interesting and fruitful.
>
>Me too. But I honestly fear each time, that I might be too saucy. Then you must
>think of my limited vocabulary, please.

If I get offended because you are too saucy, then that would be silly of me.  I
tend to get saucy myself.  There is an expression "He can dish it out but he can
't take it." and it isn't a flattering one.  I should hope it would not be
applied to me.



This page took 0.02 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.