Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: nullmove and tactics

Author: Dann Corbit

Date: 15:56:53 03/27/04

Go up one level in this thread


On March 27, 2004 at 18:21:56, Pat King wrote:

>On March 26, 2004 at 20:14:15, Sune Fischer wrote:
>
>>On March 26, 2004 at 04:54:57, Uri Blass wrote:
>>
>>>On March 24, 2004 at 17:31:35, Dann Corbit wrote:
>>>
>>>>On March 24, 2004 at 16:53:08, Uri Blass wrote:
>>>>[snip]
>>>>>The difference is more important and 10-0 is clearly more telling than 19-11
>>>>
>>>>It is stronger, but less reliable.
>>>
>>>No 10-0 is clearly more reliable than 19-11
>>
>>The interesting question is if 10-0 is more "reliable" than 20-10, and it isn't.
>
>From a statstical viewpoint, it is. 10-0 far exceeds 99% confidence, whereas
>20-10 doesn't quite reach 95% confidence (see my table of "significant" wins
>elsewhere in this thread).

By confidence, you mean "better or not" I assume, because it is statistically
invalid to compute confidence intervals with less than 30 measurements.

>>One can prove that draws are not important if one is only interested in knowing
>>which one is better.
>
>Now this is an interesting point. My statistical anlyses have assumed decisive
>games, because in my testing I've come across very few draws in C to C games. Is
>your assertion "draws are not important" because (for instance) your 20-10
>result is really a 15-5-10 result (which I think would reach my binomial
>threshold), or is there some sort of "trinomial" distribution out there that I
>should be aware of?

500 games and 100 draws with 280 wins for me and 120 wins for the opponent.

500 games and 400 draws with 70 wins for me and 30 wins for the opponent.

500 games and 590 draws with 7 wins for me and 3 wins for the opponent.

The ratio of wins is the same.
The ratio of scoring is:

(50 + 280)/500 = 0.66

(250 + 70)/500 = 0.64
 verses
(295 + 7)/500 = 0.604

Not a dominating effect, but I think it is a mistake to ignore it.

>>Note this is not to be confused with the question of how much difference in
>>strength there is.
>>It's two very different questions.
>
>A point I granted Dann elsewhere in the thread.
>>
>>>It usually will not happen but it does not mean that it is less reliable when
>>>it
>>>happens(you may suspect that something in the conditions is wrong when you see
>>>10-0 but if you see that no program was significantly slower in nps during the
>>>match than you can safely stop the match after 10-0 and say that the new program
>>>is better).
>>
>>If I suspect something might be wrong I will stop the match and investigate, but
>>one can easily imagine 10-0 or similar under proper conditions.
>
>Like newbie engines with no book (repeating the same test 10 times).
>Like, perhaps, a very small or very bad book.
>Like, what else?
>>
>>In fact yesterday I played a match where the score after 15 games was 13.5-1.5
>>in favor of the new version.
>>It actually ended up losing the match by 49-51 :(
>
>I would argue, based on your final result, that you cannot conclude anything
>about the difference between your two programs. I certainly wouldn't throw out
>your new version on that result. That you got to a 12 game difference and then
>ended up with an inconclusive result is highly unusual, but of course not
>impossible. I most likely would have stopped testing at or before the 13.5-1.5
>game point, and at the 100 game point, I don't think you can prove I would have
>made a mistake.

After ten games, I would immediately run 20 more.  Then I would have a
reasonable idea about how much better the change really is.

>>Honestly I do not remember having seen such a drastic score difference before,
>>but I do regularly see a sequence of 6-8 straight wins by one of the engines in
>>100 game match, so it's not impossible to imagine this might occur at the
>>beginning of the match.
>>
>>-S.
>When you're playing 100 game matches, and therefore have seen 1000s of games,
>I'm not surprised you've seen 6-8 game streaks. But I would argue that you
>haven't learned anything by playing such long matchs. Yes, those streaks might
>make me accept a bad new version. But after 100 games, you can only still say
>that I "might" have made a mistake.

The more games you play, the more likely to see long streaks (obviously).  I
think that there was an SSDF streak that started off 10/0 and ended up about
even, but I am recalling this from memory and I might be wrong about it.

We would expect a 10/0 or an 0/10 streak between two even programs (crudely)
about once every 512 matches.

>One weakness to my testing method (go until you get a "significant" result (I
>use 95% confidence) or until I get bored), is that it smacks of self-selection.
>If one waits long enough, chance will ensure the answer one wants. So picking a
>30 or 100 game limit seems a reasonable safeguard against this.
>
>What to do with your inconclusive result in such cases is another matter. If
>your 49-51 result were testing your implementation of null move, I'd be worried.
>If you were only messing around with some eval weights, I'd be reassured that I
>hadn't broken anything TOO badly.
>
>Bottom line, this stuff is hard :)

No argument there.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.