Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Bilbao and statistically meaningful results

Author: Roger D Davis
Date: 10:01:16 10/10/04
On October 10, 2004 at 04:52:32, GuyHaworth wrote:

>
>See http://www.talkchess.com/forums/1/message.html?390944
>
>Kurt Utzinger has calculated the 'tournament performance ratings', i.e. most
>likely ELOs.  According to him:
>
>  FRITZ, rated at 2700, comes out +308 at 3008
>  HYDRA, rated at 2700, comes out +341 at 3041
>      even though it comes 2nd on a countback of opponents' game scores here
>  Topalov, FIDE ELO 2757, comes out -149 at 2608
>  DEEP JUNIOR, rated at 2700, comes out -092 at 2608
>  Ponomariov, FIDE ELO 2710, comes out -209 at 2501
>  Karjakin, FIDE ELO 2576, comes out -075 at 2501
>
>A player's TPR is not affected by the ELO they go in at.  In fact, the engines'
>TPRs are 'FIDE ELOs' as all their opponents have FIDE ELOs, whereas the humans'
>TPRs are 'SSDF ELOs', not to be compared with their original FIDE ELOs.
>
>Ponomariov and Karjakin both scored 1, and both TPR at 2501 because all their
>opponents were rated at 2700 before the tournament.
>
>Topalov and DEEP JUNIOR both scored 1.5, and both TPR at 2608 though the reason
>is more interesting.  The average ELO of JEEP JUNIOR's opponents was 2700,
>obviously a coincidence.
>
>The machines have an average TPR of 2853.
>
>
>But these TPRs are only the 'most likely' TPRs and I don't have a way of saying
>how likely.  SSDF give a +- band for its 'SSDF ELO' ratings, so, e.g.,
>
>  SHREDDER 8.0 CB is 2818 (+34, -32) with 70% from 481 games
>  ... against opponents averaging an 'SSDF ELO' rating of 2673
>
>  SHREDDER 7.04 UCI is 2809 (+24, -23) with 71% from 967 games
>  ... against opponents averaging an 'SSDF ELO' rating of 2648
>
>
>The interval is defined so that the actual 'SSDF ELO' of the engines has a
>probability of 0.95, or is 95% likely, to fall in the given band.  It would be
>good if FIDE would do the same.  Confidence limits for lower confidence levels
>can be calculated by standard maths.
>
>Note that 2x the games (as above) divides the width of the band by
>~square_root(2).  It would need 4x the games to divide the width of the band by
>~2.
>
>In this tournament we have 4 games rather than ~512, 2^7 times more.
>
>So - not at all rigorously - I would expect the 95%-confidence-interval band for
>the TPR to be 11.314x wider than +-33, i.e. 373.
>
>i.e. One might say that:
>
>  FRITZ is 95% likely to have a 'FIDE ELO' of 3008 +- 373, (2635, 3381)
>  HYDRA is 95% likely to have a 'FIDE ELO' of 3041 +- 373, (2668, 3414)
>  DJ    is 95% likely to have a 'FIDE ELO' of 2608 +- 373, (2235, 2981)
>
>The TPRs are much less significant than might at first seem.
>
>Maybe if someone has the proper ELO-calculating software, these 95% confidence
>intervals can be superceded by the real ones.
>
>I'm interested in the likelihood that the engines have a 'FIDE ELO' >= 2700. I
>think we can say, that on the evidence, and against carbon rather than silicon
>competition:
>
>  HYDRA and FRITZ are certainly 90% likely to be over 2700
>  DEEP JUNIOR is [clearly] over 50% likely to be less than 2700
>
>
>These numbers change a lot on a half-point just missed or just gained.  Machines
>don't blunder by missing a tactic in the same way as humans, and maybe the
>humans did so in this tournament: I don't know.
>
>To see how well the machines were prepared to play the opponent as well as the
>game, at least in the opening, one has to look at how they seem to emerge at say
>move 15.
>
>I haven't studied the games to see precisly what happened.  DEEP JUNIOR had the
>only machine loss, to Karjakin, so I'd ask why that was.  Amir/Shay work v hard
>to prepare for specific opponents, so their showing here with DEEP JUNIOR is a
>surprise to me.
>
>g

Thanks for the very detailed answer...much appreciated. :)

Roger
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.