Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: SSDF Rating List 2006-01-03 - no longer acceptable !

Author: Joseph Ciarrochi

Date: 12:49:03 01/05/06

Go up one level in this thread


Hi Dan,

re: testing the difference between differences.

The stats side gets a little tricky because the engines play many other engines
(which is fine), but also play each other, which means their is some dependency
between the data points (which violates some assumptions I think, although i
doubt the violations would be serious with such large sample sizes of
nondependent data). So we could use only data where computers don't play
themselves.

Actually, the main reason i even emailed is because i have this hypothesis that
fruit 2.2.1 gets better with longer time controls. This is consistent with the
pattern of findings between cegt and ssdf (though the pattern may not be
significant:)).

If i could figure out a way to automate it, i could run fritz versus fruit at
five minutes per side, then 6 minutes, then 7 etc, and we could see if their is
any trend.

best
Joseph


On January 05, 2006 at 15:38:01, Dann Corbit wrote:

>On January 05, 2006 at 15:18:25, Joseph Ciarrochi wrote:
>
>>
>>>
>>>Statements like this come from a fundamental misunderstanding of the mathematics
>>>involved.
>>
>>
>>
>>Thank you for your comments Dan. I should note that I have no fundamental
>>misunderstanding here. I teach statistics at the university level. However, I do
>>think it is good that you keep making the points you make. I should not toss
>>"significantly" around, even if this is just a fun hobby  cite.
>>
>>I suppose my main question is, "is there a difference between the CEGT and SSDF
>>rating." To test this, you need to examine whether the difference between fruit
>>and fritz in the CEGT rating list is smaller than the difference between fruit
>>and fritz in the SSDF list (the complete agreement hypothesis you state below).
>>This is a difference between difference test, not a direct test between means. I
>>could answer this question with some time, but , well, this is a hobby site and
>>i don't want it to look too much like what i do at work :)  (though my
>>statistician geek side is pulling me to do this test. argh)
>
>I would be interested in the mathematics.  My major was Numerical Analysis, so
>you may even have me at a disadvantage here.
>
>My interpretation of both lists is:
>"Fritz 9 and Fruit 2.2.1 are of the same strength, within experimental
>certainty."
>
>Given that the experiments test different things (CEGT is at much faster time
>control and uses standardized books, SSDF is at slower time control and uses own
>books) I do not think we should expect agreement (IOW, agreement or disagreement
>of the measurements would be equally unsurprising).
>
>I think it would be a mistake to test every program against the same opponents,
>unless you did a complete round-robin (with at least two games so color bias is
>removed), which I think would be so tedious that nobody could concevably attempt
>it.  Just the setup time would be mind boggling.
>
>>Generally, I want to avoid emails that look like the results section of my
>>journal papers. I am definitely not casting aspertations at the SSDF cite. I'm
>>just wondering, what are the key variables in which the cite differs?
>>
>>Anyway, what can I say. I think you do a nice job of explaining statistical
>>error, and i hope you keep doing it :)
>>
>>best
>>Joseph
>>
>>
>>
>>
>>
>>>
>>>> The current list has fruit significantly better than fritz9, but the CEGT list
>>>>has them as similar, and all my (admitadly informal) tests has them as equal.
>>>>Maybe as the number of games keep coming in, we will see the gap between fruit
>>>>and fritz decrease?
>>>
>>>      THE SSDF RATING LIST 2006-01-03   1104075 games played by  274 computers
>>>                                           Rating   +     -  Games   Won  Oppo
>>>                                           ------  ---   --- -----   ---  ----
>>>   1 Fruit 2.2.1  256MB Athlon 1200 MHz      2852   35   -33   457   68%  2717
>>>   2 Fritz 9.0  256MB Athlon 1200 MHz        2819   32   -30   587   74%  2639
>>>
>>>2819 + 32 = 2851
>>>2852 - 33 = 2819
>>>
>>>Within experimental certaintly, the SSDF list does not tell us which one of
>>>these two programs is strongest.
>>>
>>>CEGT:
>>>All versions, adapted to Shredder 9 with 2750 ELO
>>># Name bayeselo 0052.15
>>>(2005-09-29) ELOstat 1.3 Score Av. Op.
>>>bayeselo Draws Games
>>>ELO + - ELO + -
>>>5 Fritz 9 2780 +14 -14 2768 +12 -12 63.8% 2674.3 30.0% 2236
>>>7 Fruit 2.2.1 2779 +16 -16 2772 +14 -14 65.5% 2663.7 33.0% 1601
>>>
>>>2780 - 12 = 2768
>>>2779 + 14 = 2783
>>>
>>>Within experimental certaintly, the CEGT list does not tell us which one of
>>>these two programs is strongest.
>>>
>>>Given that the tests are under VERY different conditions (time control, books
>>>used, etc.) I find it quite interesting that the two placements are in complete
>>>agreement (Fritz 9 and Fruit 2.2.1 are of about the same strength).



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.