Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: about statistics, and junior in bilbao, for graham laight

Author: Vasik Rajlich

Date: 08:01:53 10/16/04

Go up one level in this thread


On October 16, 2004 at 10:13:53, martin fierz wrote:

>On October 16, 2004 at 09:44:13, Vasik Rajlich wrote:
>
>>On October 16, 2004 at 04:18:06, martin fierz wrote:
>>
>>>hi graham!
>>>
>>>in the last days you suggested that junior seriously underperformed in bilbao
>>>and even wrote a small program to prove your point. you were quite undeterred by
>>>all the people saying "too little games" because you were looking at the results
>>>your simulator gave you. i'd like to explain why your argument is flawed, and i
>>>will use your little program to do it :-)
>>>
>>>let's see, i will take the probabilities to win, lose and draw for the average
>>>computer player to be 50%, 40% and 10% (that is my 'best estimate' based on the
>>>actual results).
>>>
>>>what do i get:
>>>DJ won 0 points in 0.02% of the tournaments
>>>DJ won 0.5 points in 0.18% of the tournaments
>>>DJ won 1 points in 1.16% of the tournaments
>>>DJ won 1.5 points in 5.23% of the tournaments
>>>DJ won 2 points in 14.09% of the tournaments
>>>DJ won 2.5 points in 24.73% of the tournaments
>>>DJ won 3 points in 29.12% of the tournaments
>>>DJ won 3.5 points in 19.21% of the tournaments
>>>DJ won 4 points in 6.26% of the tournaments
>>>
>>>now, the disagreement begins as to what these numbers mean. you are implying
>>>that the above numbers indicate that DJ has a very, very low probability of
>>>scoring only 1.5 points. that in itself is quite true, but *every* single result
>>>is rather unlikely. what you really need to do is compare the most likely
>>>outcome (scoring 3 points) against the actual outcome (if you believe that the
>>>underlying winning probabilities are the truth). and NOT compare every single
>>>result vs 100%!
>>>
>>>so: most probable outcome would be all computers score 3 points, with a joint
>>>probability of this happening being (0.2912)^3 = 0.0247 = 2.5%
>>>the actual outcome had a probability of (0.1921)^2*(0.0523) = 0.0019 = 0.2%.
>>>
>>>these numbers show: the probability of any SINGLE result is very low - even the
>>>most probable result only happens in 2.5% of all cases. the probability of the
>>>actual result happening is 13 times smaller. in this sense, if you want to stick
>>>to your hypothesis that all computers were of similar strength, then this was a
>>>slightly unusual result. but it was most definitely NOT an improbable result.
>>>your mistake seems to be that you take the probability of a result occurring,
>>>and compare it to 1 ("0.2% is very unlikely - 1 in 500"). instead, you have to
>>>compare it with the probabilty of the most likely result occurring, and then
>>>things don't look improbable at all (0.2% vs 2.5% - 1 in 13). did i make this
>>>point clear enough?
>>>
>>>
>>>now, with all this said and done, the result gets even more likely if you factor
>>>in the playing strength of the humans. the match was very weird in the sense
>>>that they had 4 rounds for 3 players each, so one program had to play one human
>>>twice. bad luck for junior, it had to play topalov twice. he was the
>>>highest-rated human of the lot, and he just came back from a stunning
>>>performance at the fide world chess championship. david levy writes
>>>(http://www.chessbase.com/newsdetail.asp?newsid=1956)
>>>
>>>"But whatever the level of preparation of team GM it did not show itself to good
>>>effect in most of the games, although Topalov appeared to have a much better
>>>understanding of how computers play chess than did either of his team-mates."
>>>
>>>so topalov was the highest-rated + best prepared for this competition according
>>>to levy (and he knows a bit something about both chess and computer chess). if i
>>>take the 1-in-13 chance of the actual result happening, and add that topalov was
>>>the strongest player on the human side, that will make the actual result more
>>>probable of course, at least 1-in-10 i would guess compared to a "most likely"
>>>result. now i don't call that unlikely. do you?
>>>
>>>cheers
>>>  martin
>>
>>You're right that the argument that "the chance of this result is only 5.23%" is
>>bogus. The right form of that argument is: the chance of this result (1.5/4) or
>>less is: 0.02% + 0.18% + 1.16% + 5.23%. That's still a pretty low number.
>
>no, no, that is not what i mean. the argument is not about this at all.
>
>[snip]
>
>>If you consider Junior to be able to stand humans as well as Fritz and Hydra,
>>what happened was an extremely anomalous result.
>
>no, it's not, that was what that post was all about. it's a 1-in-10 chance or
>so. which i would definitely not call "extremely anomalous". would you?
>
>the whole argument is about the following: take a dice. roll it 3 times. you
>get, for argument's sake, the sequence 1,1,1. you stop and wonder - "wow, this
>was really unlikely" - you know some maths, and you go and calculate that the
>chance of this thing happening was 1 in 216. so you say "wow, this was really
>unlikely, 1 in 216, i proved it with numbers". that is grahams argument
>translated to a dice, and it is totally wrong, since you cannot do postmortem
>analysis of probabilities. ANY 3-number-sequence would have been exactly as
>unlikely as 1,1,1. and that is the point.
>even if the likelihood of 1,1,1 occurring was only 1 in 216, it was EQUALLY
>likely to appear as any other sequence.
>
>so, even if you assume that the machines all had equal winning chances in bilbao
>(wrong, since junior faced the strongest opposition), then what happened in
>bilbao had a decent chance to happen, because COMPARED TO THE MOST LIKELY
>OUTCOME it was still quite a likely outcome (10 times less likely than the most
>likely outcome).
>
>i can't make it any clearer than this, i'm afraid. but at least i tried, and
>didn't just say "not enough games", which is of course quite correct :-)
>
>cheers
>  martin

Martin,

first of all, statistics are a difficult topic, you can interpret them many
ways, and you can argue about competing interpretations for a long time without
conclusion. Since you are a physicist I know that know this. I worked for >5
years doing radar simulations - it was pure statistics - so I know this too.

So, let's start with your tables. We see that the chance of Junior scoring 1.5
or less is around 7%. We also see that the chance of Fritz and Hydra scoring 7/8
or more is - an an eyeball figure - let's say 3%. The chance of both events
happening together is around 1/500, give or take. So far, so good.

We can also convert this number to a confidence level that this result did not
come from just a random fluctuation, but this will give us just another # which
we have to interpret. A lot of people consider 95% confidence as holy - but
really that's just another number. As an eyeball guess I doubt that we have 95%
here, but anyway ..

The big question is - why is this unlikely event different than the unlikely
event of rolling 1,1,1? The reason is that there is nothing intrinsically
special about rolling 1,1,1. There is no hypothesis which is supported by it.

In fact, that's not quite right. We could make a hypothesis that a 1 is more
likely than other numbers, and we have some evidence for it. We could make a
hypothesis that whatever number we roll is more likely to arise the next time -
and we also have some evidence for it.

This is where common sense (or previous knowledge) comes it. We "know" that a 1
is as likely as a 2, we "know" that dice have no memory, so we reject any such
hypotheses and call it luck.

In the case of Junior, if you "know" that Junior is not weaker against humans,
then you also reject any such hypothesis, and consider the result as you would
1, 1, 1. Ditto if you "know" that it cannot be weaker, according to your
understanding of computer chess, etc.

If somebody comes to you and tell you that because of this performance he thinks
it's likely that Junior is weaker than Fritz against humans, you can only tell
them what you would tell somebody who tried to draw conclusions from 1,1,1 -
that you "know" that this is impossible and therefore that the unusual result
was in fact luck. It's only from previous knowledge, though - that's the only
defense.

Vas

ps. Personally, I think that Junior and Fritz and have played enough previous
games against humans that we can be quite sure that they are of comparable
level. Hence, I also believe that the result was pure luck. (Although if we were
told that the Junior version playing was experimental and turned out to be 150
points weaker, that would be no huge surprise either.)



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.