Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: About head or tail (was Upon scientific truth - the nature of informati

Author: Enrique Irazoqui

Date: 04:33:41 07/17/00

Go up one level in this thread


On July 17, 2000 at 07:15:45, Ed Schröder wrote:

>On July 17, 2000 at 06:18:38, Harald Faber wrote:
>
>>On July 16, 2000 at 17:56:22, Ed Schröder wrote:
>>
>>>On July 16, 2000 at 05:30:48, Harald Faber wrote:
>>>
>>>>On July 16, 2000 at 03:34:45, Ed Schröder wrote:
>>>>
>>>>>>posted by Dann Corbit on July 15, 2000 at 20:21:54:
>>>>>
>>>>>>Simplifying.  I have a penny.
>>>>>>I toss it twice.
>>>>>>Heads, heads.
>>>>>>I toss it twice
>>>>>>Heads, heads.
>>>>>>I toss it twice
>>>>>>Tails, heads.
>>>>>>I toss it twice
>>>>>>Heads, tails.
>>>>>
>>>>>>I count them up.
>>>>>
>>>>>>Heads are stronger than tails.
>>>>>
>>>>>>My conclusion is faulty.  Why?  Because I did not gather enough data.
>>>>>
>>>>>Right.
>>>>>
>>>>>A few months ago Christophe posted some interesting stuff here regarding
>>>>>this topic and nobody really was in agreement with him (me included) until
>>>>>I did an experiment which worked as an eye opener for me. The story is not
>>>>>funny and goes like this...
>>>>>
>>>>>In Rebel Century's Personalities you have the option [Strength of Play=100]
>>>>>The value may vary from 1 to 100 and 100 is (of course) the default value.
>>>>>
>>>>>Lowering this value will cause Rebel to lower its NPS. This opens the
>>>>>possibility to create (100% equal!) engines with as only difference
>>>>>they run SLOWER.
>>>>>
>>>>>I was interested to know HOW MANY games it was needed to show that a 10%
>>>>>faster version could beat a 10% slower version and with which numbers. So
>>>>>I created  two personalities:
>>>>>
>>>>>FAST.ENG (default settings) [Strength of Play=100]
>>>>>SLOW.ENG (default settings) [Strength of Play=80]
>>>>>
>>>>>and started to play 600 eng-eng games with Rebel's build-in autoplayer
>>>>>with pre-defined fixed opening lines both engines had to play with white
>>>>>and black.
>>>>>
>>>>>The personality with as only change [Strength of Play=80] caused Rebel to
>>>>>slow down with exactly 10% on the machine the marathon match took place.
>>>>>Note that this value (80) may differ on other PC's in case you want to do
>>>>>similar experiments.
>>>>>
>>>>>Here are the results of the 600 games played between the FAST and SLOW
>>>>>personalities. The first 300 games were played on a time control of "5
>>>>>seconds average". The second 300 games were played on a time control of
>>>>>"10 seconds average".
>>>>>
>>>>>FAST - SLOW   162.5 - 137.5   [ 0:05 ]
>>>>>FAST - SLOW   147.0 - 153.0   [ 0:10 ]
>>>>>
>>>>>The first match of 300 games at 5-secs looks convincing. A 54.1% score
>>>>>because of the 10% more speed seems a value one might expect.
>>>>>
>>>>>But what the crazy result of match-2? Apparently after 300 games it is
>>>>>still not enough to proof that the 10% faster version is superior (of
>>>>>course it is) but the match score indicates both versions are equal
>>>>>which is not true.
>>>>>
>>>>>So how many games are needed to proof that version X is better than Y?
>>>>>
>>>>>I am sure I am trying to reinvent the wheel. The casino guys who make
>>>>>themselves a good living (with red and black) have figured it all out
>>>>>centuries ago. Perhaps there is a FAQ somewhere on Internet that
>>>>>explains how many times you have to turn the wheel to get an exact
>>>>>50.0% division between red and black. 1000? 2000?
>>>>>
>>>>>To answer this question I wrote a little program that randomly emulates
>>>>>chess matches. It shows that 100 games is nothing, too often scores like
>>>>>60-40 appear on the screen. 500 games (and higher) seems to do well as
>>>>>most of the time match scores fall within the 49.0 - 51.0 area.
>>>>>
>>>>>The bad news (in any case for me) is that it hardly makes any sense to
>>>>>test candidate program improvements using (even) long matches. Back to
>>>>>common sense: 10% = 10% = better. Oh well...
>>>>>
>>>>>Ed
>>>>
>>>>This is exactly what I praise for ages.
>>>>500 games show a tendency. If you get a 70-30 result by playing 500 games it is
>>>>unlikely that the 30-program is stronger than the 70-program. But the other
>>>>question is if the 70-program is really stronger or will it decrease to the
>>>>50%-area? Or even worse, you get a 55-45 result...Finally in computer matches
>>>>there are wide opening books. So your first 10 games might never be repeated. >Or you play another 10-game match and get a completely different result than in
>>>>your first 10 game match because of different opening lines...
>>>
>>>>So what to do to verify improvements or to get an idea if program a is stronger
>>>>than program b? I don't know.
>>>
>>>In the early days of a chess programmer it is easy but when your program
>>>is over 2300-2400 it becomes very difficult to judge a candidate program
>>>improvement. Personally I use a main set of 70-100 positions (frequently
>>>updated) which are tested manually first then a large set of >500 positions
>>>that runs automatically that produces a detailed report and database of
>>>every difference in regard to the previous version. If results are good
>>>then an engine-engine 300 game match is done as described above. In a
>>>later stadium (after a couple of program changes) some auto232 matches
>>>are played. The latter is of minor importance (in respect to the changes)
>>>as too much randomness is involved (book, learning). In the end my feeling
>>>on a program change is the decisive factor.
>>
>>
>>Anyway this is a very time spending task.
>
>That's why most of us need a full year if you know what I mean.
>
>
>>>>Playing 1000 games with tournament time control
>>>>takes much too much time. Test positions don't reflect practical play.
>>>>I really have no clue.
>>>
>>>>And that is why I always say thet the top-10 (!) programs
>>>>play at equal strength.
>>>
>>>That's a bold statement.
>>>
>>>Ed
>>
>>I know. Prove me wrong. :-)
>
>How about a 10 game match....?

What for? What a waste... Comp-comp won't prove a thing no matter how many games
you play. Let's take a quick look:

1 - Programs are helpless against anti-computer strategy, like Fritz in
Frankfurt and Junior in Dortmund. Their performance is inversely proportional to
human awareness of this shortcoming, and search alone won't solve the problem,
or at least it won't solve it before we all become very bald. Oh yes: in
comp-comp search is everything.

2 - Programs are essentially polite social beings: they behave like GMs amongst
GMs, like 2300s amongst 2300s. For instance, look at Junior's performance in
Dortmund and in the Israeli league.

3 - If program A has extra code to avoid closed positions and program B does
not, comp-comp won't show the difference as an advantage for A. If B is a faster
searcher, the extra code will harm A when playing B.

4 - Comp-comp games show a partial and rather uninteresting picture, their
results don't necessarily correlate to human-comp and watching them can even
become a threat to one's mental health.

Now go figure the statistic certainty of 10, 100 or 1000 comp-comp games.

Enrique

>Ed



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.