Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: nullmove and tactics

Author: Sune Fischer

Date: 16:17:27 03/27/04

Go up one level in this thread


>>One can prove that draws are not important if one is only interested in knowing
>>which one is better.
>
>Now this is an interesting point. My statistical anlyses have assumed decisive
>games, because in my testing I've come across very few draws in C to C games. Is
>your assertion "draws are not important" because (for instance) your 20-10
>result is really a 15-5-10 result (which I think would reach my binomial
>threshold), or is there some sort of "trinomial" distribution out there that I
>should be aware of?

There might be :)
Have a look at RĂ©mi Coulom's paper, I think it is called "who is better" from
november 24, 2002.

It explains, to those who can follow the math, why draws do not count.

>>If I suspect something might be wrong I will stop the match and investigate, but
>>one can easily imagine 10-0 or similar under proper conditions.
>
>Like newbie engines with no book (repeating the same test 10 times).
>Like, perhaps, a very small or very bad book.
>Like, what else?

First of all, I have a good imagination :)
Secondly, I've see plenty of matches where one engine leads by 10 points, then
gets overtaken and is behind 5 points and comes back to win by 5 points.

Such things happen in long matches, so I wouldn't put any trust in 10 games,
personally.
Just try flipping a coin 10 times and see how often you get a 5-5 result,
getting 6-4/4-6 has a higher probability, even 7-3 is IIRC more likely.

>>
>>In fact yesterday I played a match where the score after 15 games was 13.5-1.5
>>in favor of the new version.
>>It actually ended up losing the match by 49-51 :(
>
>I would argue, based on your final result, that you cannot conclude anything
>about the difference between your two programs. I certainly wouldn't throw out
>your new version on that result. That you got to a 12 game difference and then
>ended up with an inconclusive result is highly unusual, but of course not
>impossible. I most likely would have stopped testing at or before the 13.5-1.5
>game point, and at the 100 game point, I don't think you can prove I would have
>made a mistake.

You would have concluded the new version was clearly stronger instead of
concluding that they are probably very close.

I would call that a mistake.

>When you're playing 100 game matches, and therefore have seen 1000s of games,
>I'm not surprised you've seen 6-8 game streaks. But I would argue that you
>haven't learned anything by playing such long matchs. Yes, those streaks might
>make me accept a bad new version. But after 100 games, you can only still say
>that I "might" have made a mistake.
>
>One weakness to my testing method (go until you get a "significant" result (I
>use 95% confidence) or until I get bored), is that it smacks of self-selection.

The problem is often that the diffences are very small, we may be talking about
5-10 elo, in that case it is expected that they will be close.

I've see these confidence tables, it's usually something a la if you lead by 30
points after 100 games you have 95% confidence.

However we never get a 30 point lead in a 100 game match between almost equal
engines, so getting to 95% is usually not possible.

>If one waits long enough, chance will ensure the answer one wants. So picking a
>30 or 100 game limit seems a reasonable safeguard against this.
>
>What to do with your inconclusive result in such cases is another matter. If
>your 49-51 result were testing your implementation of null move, I'd be worried.
>If you were only messing around with some eval weights, I'd be reassured that I
>hadn't broken anything TOO badly.
>
>Bottom line, this stuff is hard :)

Definitely.

If you want to revolutionize computer chess don't try and invent a new
algorithm, rather find a way to quickly test for small improvements! :)

-S.
>Pat



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.