Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Value of playing different versions of a program against each other

Author: Richard Pijl
Date: 03:22:19 01/08/03
On January 07, 2003 at 10:17:03, Lieven Clarisse wrote:

>On January 06, 2003 at 18:24:31, Dann Corbit wrote:
>
>>On January 06, 2003 at 17:40:53, Lieven Clarisse wrote:
>>
>>>On January 06, 2003 at 16:56:35, Tom King wrote:
>>>
>>>>Hi all,
>>>>
>>>>What do people think about playing different versions of your program against
>>>>each other as a way of testing?
>>>>
>>>>I'm playing around with it right now, between v0.07 and a newer version of my
>>>>program. The newer version is winning handsomely: +24,=18,-10.
>>>>
>>>>This implies a reasonably impressive increase in strength, almost 100 ELO. Ok,
>>>>ok, it's a small sample, so the margin of error could be big.
>>>>
>>>>However, my gut feel is that playing different versions of your programs tends
>>>>to overstate the strength differences. What do people think?
>>>>
>>>
>>>The best way IMHO is to test it against engines with more or less equal
>>>strenght. You can use WBEC ratinglist to get an idea of the strenghts of the
>>>different engines. Try to find a range were your program gets %50 score (for
>>>instance range 40-50 from the ratinglist, when changing your program and see you
>>>get >60% (for a sufficient large numbere of games) it is time to play against
>>>the range 35-45, etc.. Strength is best measured when playing equal opps.
>>
>>I disagree.  I think you get better results from two sets:
>>1.  Programs that are about 100 ELO weaker.
>>2.  Programs that are about 100 ELO stronger.
>>
>>When the programs are about the same strength, you get too much coin toss
>>effect.
>
>I have to disagree, the larger the ELO difference, the larger the marge of
>error, ie the more games you have to play to now the ELO difference.
>
>Say your engine has 1500 elo: you lose 10 games against ruffian? you have not
>have any information about the engines strenghts, you know it is significantly
>weaker, but how much?
>
>At FICS, your RD decreases most if you play EQUAL opponents. I don't see why
>testing it against engines with +/- 100 ELO would be better; the further you go
>away from it's ELO the larger the amount of lotery involved. If you play 5 games
>against an engine that has 100 points more than your engine, it can well be that
>you lose them all. Tiny improvements will be best reflected if you play your own
>strenght: say in 100 blitz games going from 50% win to 55% win.
>If you play against higher opponents, the difference will be smaller; and harder
>to see...

If you have an error it is more likely that you will notice it when playing
against a weaker program. I always include both stronger and weaker programs (up
to 300-400 ELO weaker) when testing the Baron. When Baron loses against a weak
program, I know that it is either a bad opening line (which should be corrected)
or a bug/weakness in my program.

For this it is of course important to chose stable programs (in performance) to
test against.

Richard.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.