Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: None of these tests are truly scientific!

Author: Don Dailey

Date: 11:31:51 01/27/99

Go up one level in this thread


On January 26, 1999 at 18:46:52, KarinsDad wrote:

>On January 26, 1999 at 17:52:25, Bruce Moreland wrote:
>
>>
>>On January 26, 1999 at 16:25:17, KarinsDad wrote:
>>
>>>I'm glad that you are running other programs against the control. At what times
>>>are you running the programs, on what type and speed processors, and what is
>>>your matching criteria?
>>
>>One minute per move, you choose the processor, and a match is scored if you'd
>>play the move at the end of the minute.
>
>I would prefer slower times. I think that the main indicator is nodes per second
>times number of seconds or average total nodes per move. I realize that this is
>difficult to estimate for Bionic, however, you guys have been doing this for a
>long time and I think you could come up with an "educated guess".
>
>I understand your practicality issue, however, I'd rather take one of the games
>that Robert checked and run as close to an approximate in number of nodes per
>move as I could (and yes, all of this is questionable due to the search changes
>of running a program with SMP vs. no SMP, different hash sizes, etc.), rather
>than run all of the games for very short durations.
>
>Statistically, neither sample set is large enough nor accurate enough (one game
>run at more exacting times, or multiple games run at quick times) to be
>considered scientific. Any results you get, no matter how you do it, have to be
>taken with a grain of salt.

I don't believe this is correct.  What makes something scientific is
how you interpret the results and what you do with it.  My intent is
to let the results guide me.  I never draw firm conclusions from
anything than an infinite amount of data.

In the case where at least one other program gets a high match rate
we have a result that is adequate for our needs.  In this case we
should drop the discussion and consider the matter closed.   In the
example like Bruce gives, where the best matching program is 60% and
Crafty matches 95%, we have somethng that is significant.   In this
case I don't consider Bionic guilty, I just don't consider the matter
closed.  If you apply black and white to statistical data there is
never enough data to be "scientific" but I don't intend to draw such a
firm conclusion from this data.   You are only right if you thought
we intended to prove something from this data.  Do you understand now?

Nevertheless, I am insisting on a more consistant testing methodology.
We can't all run at different depths on different hardware with different
matching rules and expect to learn much.  I am in favor of
the most liberal matching rules and deeper runs (as you are) because
they better match the actual playing conditions and the more liberal
matching rules gives Bionic the benefit of the doubt which I think is
the more fair approach.  If the results are unclear even with liberal
matching rules we drop it.

As far as matching unequal hardware, I have no problem with considering
all pentium class machines equal by clock speed (even though they really
are not) and overcompensating for the slower ones.   What I would like
to see is that someone does a reference Crafty test (with the version
that is alledged to be the Bionic version) and that all other programs
and tests be guaranteed to be at least equal in hardware/time.  For
instance if the reference is done at Bruce's 1 minute level on a
pentium II 400 MGZ we might require everyone else to run at 2 minutes
adjusted by clock speed.   4 minutes for a pentium 200, etc.   We might
double the time for 486's etc.

This test by nature is inexact, but if you construct it correctly, it
can be used effectively to draw some conclusions,  not the least of
which might be to stop the discussion.


- Don


>My way would (rough guess) take 7 (?) games * 10 minutes per move per side (due
>to a slower speed system) * 120 moves per game (for both sides combined, this is
>an average, I did not look it up) * the number of programs tested (say 6) or
>about 5 weeks if one person did it all. However, if you gave a different game
>each to 7 individuals (all of whom had all 6 programs to test with), it would
>take about 5 days. This would give you a better (but still not perfect) set of
>data then 1 minute per move IMO.
>
>However, I am not doing the tests, so I'm not trying to tell you how to do it.
>Just my opinion.
>
>>
>>I am flexible about the processor because I didn't want to split hairs over
>>whether a P5/133 is X% slower than a P6/200 or whatever.  I figured that a few
>>people might run this on Crafty uing different hardware, and that might make
>>show us what effect this had on match rate.
>>
>>This is a little too multivariate to make a good controlled experiment, but
>>people will have reservations, possibly the same people, no matter what attempts
>>are made to control the experiment better.  I don't think it is possible to
>>control it perfectly, so if you try to do so, people will point out the flaws
>>anyway.
>
>I agree. No matter what you do, people (like myself above :) ) will point out
>the "flaws" (I prefer to think of them as alternatives).
>
>Good luck with your tests!
>
>KarinsDad
>
>>
>>bruce



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.