Computer Chess Club Archives


Search

Terms

Messages

Subject: Statistics and Test results

Author: Chris Welty

Date: 01:56:51 10/07/04


Here's how to tell if test results in a head-to-head competition are
statistically meaningful or simply within normal error margins:

Ignore draws, since they don't tell us which engine is better. Just look at the
number of wins (W) and losses (L).

Calculate the net Score  S=W-L and the number of Results  R=W+L.
Now calculate T=abs(S/sqrt(R)).

If two programs are equal strength, then 95% of test runs will have T<2, and
99.9% of test runs will have T<3. So if T>3 it's unlikely to have happened by
chance. Even T>2 is pretty good. Less than 2 tells you you don't have enough
games.

If you've set up the engines wrong, the statistics are still valid but they
measure whether engine A (wrongly set up) is better than engine B (wrongly set
up).

Examples:
Junior (on a 233MHz PC) beats Arasan (3.2GHz PC) 7-0:
S=7, R=7, T=2.6, fairly likely Junior beats Arasan.

Pro Deo 1.0 (default) vs Shredder 8      09.0 - 31.0  (+ 4 =10 -26)
S=-22, R=30, T=4.0, Shredder beats ProDeo with these settings.

Pro Deo 1.0 (GS2)     vs Shredder 8      16.5 - 23.5  (+ 9 =15 -16)
S=-7, R=25, T=1.4, need more testing to say which is better.




This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.