Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: the guy who said cm6000 is stronger than 8000 is right!

Author: Dann Corbit

Date: 12:25:08 05/22/01

Go up one level in this thread


On May 22, 2001 at 14:50:00, James T. Walker wrote:

>On May 22, 2001 at 13:39:22, Dann Corbit wrote:
>
>>On May 22, 2001 at 13:27:09, stuart taylor wrote:
>>[snip]
>>>Yes, Crafty is number 4, which I overlooked. Sorry!
>>>But I didn't overlook the comercial list. But that was very few games, which is
>>>good, but says very little. But you can't just bungle all the amatuer programs
>>>together with it to make CM8K to look so great overall. It's nowhere near the
>>>same category as tests against recent comercial programs.
>>
>>In what way?
>>
>>According to the last WMCCC, the world champion was Shredder.  The runner up was
>>Ferret, an amateur program.
>>
>>In the previous CCT contest, in which several commercial programs particpated,
>>the winner was...
>>Crafty -- an amateur program.
>>
>>Look at the recent Leiden contest.  Some amateur programs were near the top and
>>triumphed over some commercial entries.  There used to be a large gap between
>>the amateur and commercial programs.  I believe that the gap was mostly due to
>>superior opening books of the commercial programs.  That gap has narrowed, as
>>the amateur entries now operate with sophisticated opening books.
>>
>>I believe that the gap between the strongest amateur programs and the strongest
>>commercial programs is very small.  Of course, there is not enough empirical
>>data to back up my assertion, so it is only an opinion.
>
>What do you mean by "very small??"  The latest SSDF list has Deep Fritz at 2650
>after 470 games and Crafty 17.07 at 2487 after 857 games.  What would constitute
>"enough empirical data??"  On what empirical data do you base your "opinion?"

   1 Deep Fritz  128MB K6-2 450 MHz          2650   34   -32   470   66%  2537
   2 Fritz 6.0  128MB K6-2 450 MHz           2626   24   -24   897   66%  2512
   3 Junior 6.0  128MB K6-2 450 MHz          2594   22   -21  1109   64%  2490
   4 Chess Tiger 12.0 DOS 128MB K6-2 450 MHz 2578   27   -27   691   62%  2492
   6 Fritz 5.32  128MB K6-2 450 MHz          2547   26   -26   741   59%  2485
   6 Nimzo 7.32  128MB K6-2 450 MHz          2547   24   -24   857   59%  2485
   8 Nimzo 8.0  128MB K6-2 450 MHz           2539   30   -30   546   58%  2486
   9 Gandalf 4.32f  128MB K6-2 450 MHz       2529   29   -29   584   52%  2518
  10 Junior 5.0  128MB K6-2 450 MHz          2528   26   -25   750   57%  2476
  11 Hiarcs 7.01  128MB K6-2 450 MHz         2526   37   -37   361   48%  2539
  12 Hiarcs 7.32  128MB K6-2 450 MHz         2525   27   -27   679   56%  2481
  13 SOS  128MB  K6-2 450 MHz                2524   23   -23   925   53%  2501
  14 Rebel Century 3.0  128MB K6-2 450 MHz   2514   31   -31   504   50%  2516
  15 Goliath Light  128MB K6-2 450 MHz       2496   30   -30   546   46%  2527
  16 Crafty 17.07/CB 128MB K6-2 450 MHz      2487   24   -24   857   47%  2505

In order for confidence to rise from 2/3 to 95%, we must use two standard
deviations.  Within that error bar, Deep Fritz is:
2650-(32*2)= 2586 ELO. 2650 + 34*2 = 2718
and Crafty 17.07 (quite an old version) is:
2487 + 24*2 = 2535 ELO.  2487 - (24*2) = 2439

So, considering the error bars (2439, 2535) : (2586, 2718) we *can* say Deep
Fritz is a little stronger with pretty good certainty.  But (of course) crafty
has gone through a raft of versions since then.  Is the same difference still
true?

Yace is similar in strength.  If given opening books of equal quality, I suspect
that the best amateur programs (e.g. Yace, Crafty) are very close to even with
the strongest commercial programs.

I think that most people have no clue what the numbers in the SSDF list mean
(and I mean not a *single* one of the numbers) and that's too bad.  Not saying
that you don't of course.  But even the ELO figure is widely misunderstood.

Does the autoplayer used to play these games still issue a reset between each
move to the engines (which the commercial programs are designed to ignore)?

Personally, I think the learning attribute of some programs is not a slant
against the autoplayer, since the games will learn in actual use also and
thereby improve.  But it raises an interesting question.  Have the learning
programs been playing longer than new entries and gaining constantly with their
learning files?  If so, is the test accurate?

In other words, I think:
1.  The SSDF is definitely the best data we have available to determine
engine/engine strength estimates

2.  The data rapidly ages with new versions of programs [e.g. Tiger 12 -- aren't
we on Tiger 14 now?, crafty 17.07 -- aren't we on 18.9 now?]

3.  The data is valid ONLY for the machines in actual use and the exact
conditions of the trials.  On different architectures, the engines quite likely
will play very differently.  I have observed this effect very much so with Intel
compiler builds running on older and on newer machines.  Up to a factor of 4 in
difference from what you would expect judging by CPU MHz alone.

4.  There may be flaws with the experiment (but I have yet to see a better
design)

5.  The error bands in the strength figures are widely misunderstood.

I don't know that you will agree with me, but I expect you can see what I am
driving at by now.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.