Author: Harald Faber
Date: 03:18:38 07/17/00
Go up one level in this thread
On July 16, 2000 at 17:56:22, Ed Schröder wrote: >On July 16, 2000 at 05:30:48, Harald Faber wrote: > >>On July 16, 2000 at 03:34:45, Ed Schröder wrote: >> >>>>posted by Dann Corbit on July 15, 2000 at 20:21:54: >>> >>>>Simplifying. I have a penny. >>>>I toss it twice. >>>>Heads, heads. >>>>I toss it twice >>>>Heads, heads. >>>>I toss it twice >>>>Tails, heads. >>>>I toss it twice >>>>Heads, tails. >>> >>>>I count them up. >>> >>>>Heads are stronger than tails. >>> >>>>My conclusion is faulty. Why? Because I did not gather enough data. >>> >>>Right. >>> >>>A few months ago Christophe posted some interesting stuff here regarding >>>this topic and nobody really was in agreement with him (me included) until >>>I did an experiment which worked as an eye opener for me. The story is not >>>funny and goes like this... >>> >>>In Rebel Century's Personalities you have the option [Strength of Play=100] >>>The value may vary from 1 to 100 and 100 is (of course) the default value. >>> >>>Lowering this value will cause Rebel to lower its NPS. This opens the >>>possibility to create (100% equal!) engines with as only difference >>>they run SLOWER. >>> >>>I was interested to know HOW MANY games it was needed to show that a 10% >>>faster version could beat a 10% slower version and with which numbers. So >>>I created two personalities: >>> >>>FAST.ENG (default settings) [Strength of Play=100] >>>SLOW.ENG (default settings) [Strength of Play=80] >>> >>>and started to play 600 eng-eng games with Rebel's build-in autoplayer >>>with pre-defined fixed opening lines both engines had to play with white >>>and black. >>> >>>The personality with as only change [Strength of Play=80] caused Rebel to >>>slow down with exactly 10% on the machine the marathon match took place. >>>Note that this value (80) may differ on other PC's in case you want to do >>>similar experiments. >>> >>>Here are the results of the 600 games played between the FAST and SLOW >>>personalities. The first 300 games were played on a time control of "5 >>>seconds average". The second 300 games were played on a time control of >>>"10 seconds average". >>> >>>FAST - SLOW 162.5 - 137.5 [ 0:05 ] >>>FAST - SLOW 147.0 - 153.0 [ 0:10 ] >>> >>>The first match of 300 games at 5-secs looks convincing. A 54.1% score >>>because of the 10% more speed seems a value one might expect. >>> >>>But what the crazy result of match-2? Apparently after 300 games it is >>>still not enough to proof that the 10% faster version is superior (of >>>course it is) but the match score indicates both versions are equal >>>which is not true. >>> >>>So how many games are needed to proof that version X is better than Y? >>> >>>I am sure I am trying to reinvent the wheel. The casino guys who make >>>themselves a good living (with red and black) have figured it all out >>>centuries ago. Perhaps there is a FAQ somewhere on Internet that >>>explains how many times you have to turn the wheel to get an exact >>>50.0% division between red and black. 1000? 2000? >>> >>>To answer this question I wrote a little program that randomly emulates >>>chess matches. It shows that 100 games is nothing, too often scores like >>>60-40 appear on the screen. 500 games (and higher) seems to do well as >>>most of the time match scores fall within the 49.0 - 51.0 area. >>> >>>The bad news (in any case for me) is that it hardly makes any sense to >>>test candidate program improvements using (even) long matches. Back to >>>common sense: 10% = 10% = better. Oh well... >>> >>>Ed >> >>This is exactly what I praise for ages. >>500 games show a tendency. If you get a 70-30 result by playing 500 games it is >>unlikely that the 30-program is stronger than the 70-program. But the other >>question is if the 70-program is really stronger or will it decrease to the >>50%-area? Or even worse, you get a 55-45 result...Finally in computer matches >>there are wide opening books. So your first 10 games might never be repeated. >Or you play another 10-game match and get a completely different result than in >>your first 10 game match because of different opening lines... > >>So what to do to verify improvements or to get an idea if program a is stronger >>than program b? I don't know. > >In the early days of a chess programmer it is easy but when your program >is over 2300-2400 it becomes very difficult to judge a candidate program >improvement. Personally I use a main set of 70-100 positions (frequently >updated) which are tested manually first then a large set of >500 positions >that runs automatically that produces a detailed report and database of >every difference in regard to the previous version. If results are good >then an engine-engine 300 game match is done as described above. In a >later stadium (after a couple of program changes) some auto232 matches >are played. The latter is of minor importance (in respect to the changes) >as too much randomness is involved (book, learning). In the end my feeling >on a program change is the decisive factor. Anyway this is a very time spending task. >>Playing 1000 games with tournament time control >>takes much too much time. Test positions don't reflect practical play. >>I really have no clue. > >>And that is why I always say thet the top-10 (!) programs >>play at equal strength. > >That's a bold statement. > >Ed I know. Prove me wrong. :-)
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.