Author: Ed Schröder
Date: 04:15:45 07/17/00
Go up one level in this thread
On July 17, 2000 at 06:18:38, Harald Faber wrote: >On July 16, 2000 at 17:56:22, Ed Schröder wrote: > >>On July 16, 2000 at 05:30:48, Harald Faber wrote: >> >>>On July 16, 2000 at 03:34:45, Ed Schröder wrote: >>> >>>>>posted by Dann Corbit on July 15, 2000 at 20:21:54: >>>> >>>>>Simplifying. I have a penny. >>>>>I toss it twice. >>>>>Heads, heads. >>>>>I toss it twice >>>>>Heads, heads. >>>>>I toss it twice >>>>>Tails, heads. >>>>>I toss it twice >>>>>Heads, tails. >>>> >>>>>I count them up. >>>> >>>>>Heads are stronger than tails. >>>> >>>>>My conclusion is faulty. Why? Because I did not gather enough data. >>>> >>>>Right. >>>> >>>>A few months ago Christophe posted some interesting stuff here regarding >>>>this topic and nobody really was in agreement with him (me included) until >>>>I did an experiment which worked as an eye opener for me. The story is not >>>>funny and goes like this... >>>> >>>>In Rebel Century's Personalities you have the option [Strength of Play=100] >>>>The value may vary from 1 to 100 and 100 is (of course) the default value. >>>> >>>>Lowering this value will cause Rebel to lower its NPS. This opens the >>>>possibility to create (100% equal!) engines with as only difference >>>>they run SLOWER. >>>> >>>>I was interested to know HOW MANY games it was needed to show that a 10% >>>>faster version could beat a 10% slower version and with which numbers. So >>>>I created two personalities: >>>> >>>>FAST.ENG (default settings) [Strength of Play=100] >>>>SLOW.ENG (default settings) [Strength of Play=80] >>>> >>>>and started to play 600 eng-eng games with Rebel's build-in autoplayer >>>>with pre-defined fixed opening lines both engines had to play with white >>>>and black. >>>> >>>>The personality with as only change [Strength of Play=80] caused Rebel to >>>>slow down with exactly 10% on the machine the marathon match took place. >>>>Note that this value (80) may differ on other PC's in case you want to do >>>>similar experiments. >>>> >>>>Here are the results of the 600 games played between the FAST and SLOW >>>>personalities. The first 300 games were played on a time control of "5 >>>>seconds average". The second 300 games were played on a time control of >>>>"10 seconds average". >>>> >>>>FAST - SLOW 162.5 - 137.5 [ 0:05 ] >>>>FAST - SLOW 147.0 - 153.0 [ 0:10 ] >>>> >>>>The first match of 300 games at 5-secs looks convincing. A 54.1% score >>>>because of the 10% more speed seems a value one might expect. >>>> >>>>But what the crazy result of match-2? Apparently after 300 games it is >>>>still not enough to proof that the 10% faster version is superior (of >>>>course it is) but the match score indicates both versions are equal >>>>which is not true. >>>> >>>>So how many games are needed to proof that version X is better than Y? >>>> >>>>I am sure I am trying to reinvent the wheel. The casino guys who make >>>>themselves a good living (with red and black) have figured it all out >>>>centuries ago. Perhaps there is a FAQ somewhere on Internet that >>>>explains how many times you have to turn the wheel to get an exact >>>>50.0% division between red and black. 1000? 2000? >>>> >>>>To answer this question I wrote a little program that randomly emulates >>>>chess matches. It shows that 100 games is nothing, too often scores like >>>>60-40 appear on the screen. 500 games (and higher) seems to do well as >>>>most of the time match scores fall within the 49.0 - 51.0 area. >>>> >>>>The bad news (in any case for me) is that it hardly makes any sense to >>>>test candidate program improvements using (even) long matches. Back to >>>>common sense: 10% = 10% = better. Oh well... >>>> >>>>Ed >>> >>>This is exactly what I praise for ages. >>>500 games show a tendency. If you get a 70-30 result by playing 500 games it is >>>unlikely that the 30-program is stronger than the 70-program. But the other >>>question is if the 70-program is really stronger or will it decrease to the >>>50%-area? Or even worse, you get a 55-45 result...Finally in computer matches >>>there are wide opening books. So your first 10 games might never be repeated. >Or you play another 10-game match and get a completely different result than in >>>your first 10 game match because of different opening lines... >> >>>So what to do to verify improvements or to get an idea if program a is stronger >>>than program b? I don't know. >> >>In the early days of a chess programmer it is easy but when your program >>is over 2300-2400 it becomes very difficult to judge a candidate program >>improvement. Personally I use a main set of 70-100 positions (frequently >>updated) which are tested manually first then a large set of >500 positions >>that runs automatically that produces a detailed report and database of >>every difference in regard to the previous version. If results are good >>then an engine-engine 300 game match is done as described above. In a >>later stadium (after a couple of program changes) some auto232 matches >>are played. The latter is of minor importance (in respect to the changes) >>as too much randomness is involved (book, learning). In the end my feeling >>on a program change is the decisive factor. > > >Anyway this is a very time spending task. That's why most of us need a full year if you know what I mean. >>>Playing 1000 games with tournament time control >>>takes much too much time. Test positions don't reflect practical play. >>>I really have no clue. >> >>>And that is why I always say thet the top-10 (!) programs >>>play at equal strength. >> >>That's a bold statement. >> >>Ed > >I know. Prove me wrong. :-) How about a 10 game match....? Ed
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.