Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Bilbao and statistically meaningful results

Author: GuyHaworth

Date: 01:52:32 10/10/04

Go up one level in this thread



See http://www.talkchess.com/forums/1/message.html?390944

Kurt Utzinger has calculated the 'tournament performance ratings', i.e. most
likely ELOs.  According to him:

  FRITZ, rated at 2700, comes out +308 at 3008
  HYDRA, rated at 2700, comes out +341 at 3041
      even though it comes 2nd on a countback of opponents' game scores here
  Topalov, FIDE ELO 2757, comes out -149 at 2608
  DEEP JUNIOR, rated at 2700, comes out -092 at 2608
  Ponomariov, FIDE ELO 2710, comes out -209 at 2501
  Karjakin, FIDE ELO 2576, comes out -075 at 2501

A player's TPR is not affected by the ELO they go in at.  In fact, the engines'
TPRs are 'FIDE ELOs' as all their opponents have FIDE ELOs, whereas the humans'
TPRs are 'SSDF ELOs', not to be compared with their original FIDE ELOs.

Ponomariov and Karjakin both scored 1, and both TPR at 2501 because all their
opponents were rated at 2700 before the tournament.

Topalov and DEEP JUNIOR both scored 1.5, and both TPR at 2608 though the reason
is more interesting.  The average ELO of JEEP JUNIOR's opponents was 2700,
obviously a coincidence.

The machines have an average TPR of 2853.


But these TPRs are only the 'most likely' TPRs and I don't have a way of saying
how likely.  SSDF give a +- band for its 'SSDF ELO' ratings, so, e.g.,

  SHREDDER 8.0 CB is 2818 (+34, -32) with 70% from 481 games
  ... against opponents averaging an 'SSDF ELO' rating of 2673

  SHREDDER 7.04 UCI is 2809 (+24, -23) with 71% from 967 games
  ... against opponents averaging an 'SSDF ELO' rating of 2648


The interval is defined so that the actual 'SSDF ELO' of the engines has a
probability of 0.95, or is 95% likely, to fall in the given band.  It would be
good if FIDE would do the same.  Confidence limits for lower confidence levels
can be calculated by standard maths.

Note that 2x the games (as above) divides the width of the band by
~square_root(2).  It would need 4x the games to divide the width of the band by
~2.

In this tournament we have 4 games rather than ~512, 2^7 times more.

So - not at all rigorously - I would expect the 95%-confidence-interval band for
the TPR to be 11.314x wider than +-33, i.e. 373.

i.e. One might say that:

  FRITZ is 95% likely to have a 'FIDE ELO' of 3008 +- 373, (2635, 3381)
  HYDRA is 95% likely to have a 'FIDE ELO' of 3041 +- 373, (2668, 3414)
  DJ    is 95% likely to have a 'FIDE ELO' of 2608 +- 373, (2235, 2981)

The TPRs are much less significant than might at first seem.

Maybe if someone has the proper ELO-calculating software, these 95% confidence
intervals can be superceded by the real ones.

I'm interested in the likelihood that the engines have a 'FIDE ELO' >= 2700. I
think we can say, that on the evidence, and against carbon rather than silicon
competition:

  HYDRA and FRITZ are certainly 90% likely to be over 2700
  DEEP JUNIOR is [clearly] over 50% likely to be less than 2700


These numbers change a lot on a half-point just missed or just gained.  Machines
don't blunder by missing a tactic in the same way as humans, and maybe the
humans did so in this tournament: I don't know.

To see how well the machines were prepared to play the opponent as well as the
game, at least in the opening, one has to look at how they seem to emerge at say
move 15.

I haven't studied the games to see precisly what happened.  DEEP JUNIOR had the
only machine loss, to Karjakin, so I'd ask why that was.  Amir/Shay work v hard
to prepare for specific opponents, so their showing here with DEEP JUNIOR is a
surprise to me.

g








This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.