Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Is the SSDF taking a break from testing?

Author: Sune Fischer
Date: 17:38:04 07/15/05
On July 15, 2005 at 20:07:46, Dann Corbit wrote:

>On July 15, 2005 at 19:27:56, Sune Fischer wrote:
>
>>On July 15, 2005 at 18:29:28, Dann Corbit wrote:
>>>
>>>The time difference is really about 4:1
>>
>>Yes but they use faster machines.
>>
>>Anyway the point is simply that what we call quality today is what we will call
>>crap in 3 years. Just like what we called quality 3 years ago is what we call
>>crap today.
>>
>>Am I the only one who can see how rediculous that is?
>
>In 1905 you would have been happy with a car that had a top speed of 25 MPH.
>Today you won't.  To me, it is like saying, "There is no need to make that any
>more beautiful because it is already beautiful enough."

I fear you're missing my point.
I want longer games too, but first and foremost I want enough games to make an
accurate rating list, without that we have nothing worth while anyway.

And if that can be achieved by playing 2 or 4 times faster, than that is what
should be done.
It seems folks insist on pushing the time control to such lengths that making
accurate rating lists isn't possible.
It is silly to sacrifice accuracy of rating for something so elusive as
"quality".
People will never be happy with the quality anyway so why chase this ghost.
Let's make it accurate instead, we can do that at least.

>No matter how beautiful it is, it can always become more beautiful.  And I will
>like it better when it becomes such.  Eventually, I suppose, I will no longer be
>able to appreciate the moves.  Then it will be time to look at something else, I
>suppose.
>
>>>>How many people actually go over these tons and tons of automated games anyway..
>>>
>>>Not many.  I also find the games between the best programs very useful for book
>>>building.  I would not trust the CEGT games for that purpose.
>>
>>I would prefer to use GM games, still.
>>Needs another few years for "the quality" to be there ;)
>
>Mostly, the opening books for the computer programs already came from there.
>And I think that at 40/2 with ponder on on 1 GHz, the computer will make less
>mistakes than any GM.  Yes, they will miss some brilliant positional moves.  But
>on average, they will make excellent choices.  Actually, I think a mix of SSDF +
>Correspondence + OTB makes the best books (as far as auto-generated books).  The
>"real" best books will be made by experts.
>
>>>>In order to construct a usable and interesting rating list priority number 1 is
>>>>to have enough games for a reliable rating, otherwise it _is_ going to be
>>>>statistical garbage.
>>>
>>>Controlling the environment of the test so that it is reproducible is probably
>>>in the same range of importance as the large number of games.
>>
>>I guess saving the logs should be enough, who is going to reproduce a long
>>tournament anyway? :)
>>
>>But actually I agree with you, which is why I _don't_ like that the SSDF use
>>books and learning. Fixed start positions give full control every single time.
>>
>>>The AEGT and CEGT games do not seem to be held at a consistent time control.
>>
>>That's not so good obviously, but probably the price you have to pay when making
>>a rating list in a big distributed manner.
>
>It may be that the different time controls are really what is wanted, if the
>machines have been calibrated to some certain number of nodes or something.  At
>any rate, I think the CEGT stuff is very good data.

That was going to be my defense, but then I noted some games had 0 increment and
others not.

>>>The AEGT and CEGT contests assume the NUNN positions as openings, and so they do
>>>not exercise the opening book of the program being tested.  That is fine to
>>>measure engine strength, but it will not tell you about book+program and it will
>>>not help you to prepare for that opponent (if it is a goal).
>>
>>People tend to make their own books and in tournaments the authors always use
>>special (handcrafted) books, so in general it's a good idea to keep engine and
>>book seperated when measuring.
>
>Depends on what you want to measure.  If it is engine strength, then I agree
>with you.  If it is system strength, then I think the results will be wrong.

Well that's true. :)

>>>The older programs have more games against them and therefore are more accurate
>>>as measuring tools.  But a lot of people get hot under the collar about running
>>>games on 450 MHz computers when they do that.
>>
>>It doesn't matter if you have stable engines. What you do is you run elostat on
>>the whole database everytime, so ratings will automaticly rescale.
>>
>>At least it seems foolish to play with an old engine if the a newer version has
>>been released. Remember you will still be playing with the old engine indirectly
>>when you play against others engines that has played against it..
>
>If I have ten games with a new engine of 3000 Elo and I have 10,000 games with
>an old engine of 2300 Elo, the old engine will give me much better data by
>playing against it than against the new one.  The new engine will have huge
>error bars in the confidence interval, and these must necessarily translate to
>the engine for which the calibrated engine is used as a reference.  I exaggerate
>the numbers to make the meaning obvious, but the message is plain enough -- you
>get better numbers from the measuring sticks with the finest graduations on
>them.

Here you simply end op with an ever growing list of engines you have to keep
playing against, and because they play more and more games it will be harder and
harder to take them out. How do you break this circle?

Take out the old engines and just play on against its opponents, there must be
many of those.

>>>There is some inconsistent naming of the program names in CEGT and AEGT,
>>
>>Such as?
>
>AnMon 5.50                : 2556   81 114    38    47.4 %   2575   36.8 %
>AnMon5.50                 : 2543   17  18  1088    41.2 %   2605   31.6 %
>
>GLC 3.01.2.2              : 2490   41  30   265    52.5 %   2473   35.5 %
>GLChess 3.01.2.2          : 2561  128 219    13    46.2 %   2588   46.2 %
>GLChess 3.0122            : 2534   86 114    38    47.4 %   2552   31.6 %
>Green Light Chess 3.01.2.2: 2561   24  32   423    46.0 %   2589   37.6 %
>Green Light Chess 301.2.2 : 2547  126 126    22    50.0 %   2547   36.4 %
>GreenLightChess 30122     : 2547  210 210    11    50.0 %   2547   27.3 %
>
>Knight Dreamer 3.3        : 2372  102  73    51    24.5 %   2568   29.4 %
>KnightDreamer 3.3         : 2367   31  23   497    50.8 %   2361   30.0 %
>
>Naum 1.8-b1               : 2609  110  65    38    55.3 %   2572   52.6 %
>Naum 1.8b1                : 2578   95  70    51    53.9 %   2551   37.3 %
>
>Pepito 1.59               : 2574   93  93    38    50.0 %   2574   36.8 %
>Pepito v1.59              : 2434   32  27   417    52.9 %   2414   28.1 %
>
>Ruffian 2.1.0             : 2644   16  12  1715    52.3 %   2628   34.2 %
>Ruffian 2.10              : 2573  147 453     5    30.0 %   2720   60.0 %
>
>There are several others I am not sure of like this one:
>Shredder 9                : 2756   10  14  2666    69.1 %   2616   29.4 %
>Shredder 9 UCI            : 2713  128 104    26    61.5 %   2631   38.5 %

Hmm annoying, should be possible one would think.

-S
Re: Is the SSDF taking a break from testing? Uri Blass 18:01:41 07/15/05
Re: Is the SSDF taking a break from testing? Dann Corbit 18:00:05 07/15/05
- Re: Is the SSDF taking a break from testing? Sune Fischer 19:19:44 07/15/05
  - Re: Is the SSDF taking a break from testing? Dann Corbit 19:50:21 07/15/05
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.