Author: Chris Welty
Date: 01:56:51 10/07/04
Here's how to tell if test results in a head-to-head competition are statistically meaningful or simply within normal error margins: Ignore draws, since they don't tell us which engine is better. Just look at the number of wins (W) and losses (L). Calculate the net Score S=W-L and the number of Results R=W+L. Now calculate T=abs(S/sqrt(R)). If two programs are equal strength, then 95% of test runs will have T<2, and 99.9% of test runs will have T<3. So if T>3 it's unlikely to have happened by chance. Even T>2 is pretty good. Less than 2 tells you you don't have enough games. If you've set up the engines wrong, the statistics are still valid but they measure whether engine A (wrongly set up) is better than engine B (wrongly set up). Examples: Junior (on a 233MHz PC) beats Arasan (3.2GHz PC) 7-0: S=7, R=7, T=2.6, fairly likely Junior beats Arasan. Pro Deo 1.0 (default) vs Shredder 8 09.0 - 31.0 (+ 4 =10 -26) S=-22, R=30, T=4.0, Shredder beats ProDeo with these settings. Pro Deo 1.0 (GS2) vs Shredder 8 16.5 - 23.5 (+ 9 =15 -16) S=-7, R=25, T=1.4, need more testing to say which is better.
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.