Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: None of these tests are truly scientific!

Author: KarinsDad

Date: 15:06:03 01/27/99

Go up one level in this thread


On January 27, 1999 at 14:31:51, Don Dailey wrote:

[snip]

>>
>>Statistically, neither sample set is large enough nor accurate enough (one game
>>run at more exacting times, or multiple games run at quick times) to be
>>considered scientific. Any results you get, no matter how you do it, have to be
>>taken with a grain of salt.
>
>I don't believe this is correct.  What makes something scientific is
>how you interpret the results and what you do with it.  My intent is
>to let the results guide me.  I never draw firm conclusions from
>anything than an infinite amount of data.

In other words, you never draw firm conclusions. Either that or you are really
good to be able to draw conclusions from infinite amounts of data. Just kidding
:)

>
>In the case where at least one other program gets a high match rate
>we have a result that is adequate for our needs.  In this case we
>should drop the discussion and consider the matter closed.

Another program such as Bionic (note: not Bionic Impatk)? Or did you mean a
different third party program?

>   In the
>example like Bruce gives, where the best matching program is 60% and
>Crafty matches 95%, we have somethng that is significant.   In this
>case I don't consider Bionic guilty, I just don't consider the matter
>closed.  If you apply black and white to statistical data there is
>never enough data to be "scientific" but I don't intend to draw such a
>firm conclusion from this data.   You are only right if you thought
>we intended to prove something from this data.  Do you understand now?

Actually, I believe it is you that do not understand my position. Any test you
do has no meaning with regard to whether something improper has been done. Why?
Because the code is not identical to Crafty's, because your tests are not done
at tournament times, etc. The testers are merely having fun testing and
comparing results. Any conclustions drawn from those results are like the
majority of statistics, in the eye of the beholder.

Granted, I am as interested as the next person in what those test results are.
And a lot of conclusions can be drawn from those results. It's just that the
conclusions cannot confirm or deny impropriety.

The implication from your paragraph above is that if all programs (including
Crafty) have a 55% to 65% match of results to Bionic Impakt's results, then the
authors of Bionic Impakt have definitely modified the code enough so that it can
be considered "unique" and not derived, and hence, no impropriety was done (i.e.
the case would be closed). Is this your position?

In order to even consider this position (and still be "scientific"), I contend
that you would have to take the games from the other competitors in that
tournament and run the same set of tests. What if Fritz comes up with a 95%
match with the King results (which obviously will not happen). Would this imply
that King was a clone of Fritz? Where does one draw the line?

My contention is that the test results are merely interesting, but not
conclusive (regardless of their results) of anything other than the test
occurred and the following data was discovered.

BTW, I agree with the rest of your posting in that I am all for making the test
cases as close as possible to the tournament conditions and if that is not
possible, as close as possible between test cases. But it still wouldn't prove
anything.

KarinsDad

>
>Nevertheless, I am insisting on a more consistant testing methodology.
>We can't all run at different depths on different hardware with different
>matching rules and expect to learn much.  I am in favor of
>the most liberal matching rules and deeper runs (as you are) because
>they better match the actual playing conditions and the more liberal
>matching rules gives Bionic the benefit of the doubt which I think is
>the more fair approach.  If the results are unclear even with liberal
>matching rules we drop it.
>
>As far as matching unequal hardware, I have no problem with considering
>all pentium class machines equal by clock speed (even though they really
>are not) and overcompensating for the slower ones.   What I would like
>to see is that someone does a reference Crafty test (with the version
>that is alledged to be the Bionic version) and that all other programs
>and tests be guaranteed to be at least equal in hardware/time.  For
>instance if the reference is done at Bruce's 1 minute level on a
>pentium II 400 MGZ we might require everyone else to run at 2 minutes
>adjusted by clock speed.   4 minutes for a pentium 200, etc.   We might
>double the time for 486's etc.
>
>This test by nature is inexact, but if you construct it correctly, it
>can be used effectively to draw some conclusions,  not the least of
>which might be to stop the discussion.
>
>
>- Don
>
>
>>My way would (rough guess) take 7 (?) games * 10 minutes per move per side (due
>>to a slower speed system) * 120 moves per game (for both sides combined, this is
>>an average, I did not look it up) * the number of programs tested (say 6) or
>>about 5 weeks if one person did it all. However, if you gave a different game
>>each to 7 individuals (all of whom had all 6 programs to test with), it would
>>take about 5 days. This would give you a better (but still not perfect) set of
>>data then 1 minute per move IMO.
>>
>>However, I am not doing the tests, so I'm not trying to tell you how to do it.
>>Just my opinion.
>>
>>>
>>>I am flexible about the processor because I didn't want to split hairs over
>>>whether a P5/133 is X% slower than a P6/200 or whatever.  I figured that a few
>>>people might run this on Crafty uing different hardware, and that might make
>>>show us what effect this had on match rate.
>>>
>>>This is a little too multivariate to make a good controlled experiment, but
>>>people will have reservations, possibly the same people, no matter what attempts
>>>are made to control the experiment better.  I don't think it is possible to
>>>control it perfectly, so if you try to do so, people will point out the flaws
>>>anyway.
>>
>>I agree. No matter what you do, people (like myself above :) ) will point out
>>the "flaws" (I prefer to think of them as alternatives).
>>
>>Good luck with your tests!
>>
>>KarinsDad
>>
>>>
>>>bruce



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.