Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: a number and two questions for bob

Author: martin fierz

Date: 14:29:00 05/05/04

Go up one level in this thread


On May 05, 2004 at 11:18:52, Robert Hyatt wrote:

>On May 05, 2004 at 05:10:36, martin fierz wrote:
>
>>hi bob,
>>
>>rereading your DTS paper (you sent me a copy once), you reported 24 speedup
>>numbers for 4 processors (given in the end, for anybody interested).
>>
>>i get (using a black box):
>>
>>av. speedup: 3.65
>>standard deviation of sample: 0.31
>>standard error of average 0.064
>>
>>so: average speedup(N=4) = 3.65 +- 0.07 would be a nice way to put this.
>
>Where does the standard error come from?

as GCP already mentioned, std.err = 1/sqrt(N)*std.dev.

>Are you looking at the speedups for
>two positions and using the difference (summed over all positions) as the error?

so no, just the simple formula above.

> That's one kind of error.  The other is the non-repeatible error which is the
>real problem that needs addressing.  If I run the _same_ test again, rather than
>3.65 it might produce 3.3 or 3.9 which is the real problem...

this is not a real problem. when i measure something in physics, this happens
ALL THE TIME. that's why we physicists are rather comfortable with repeating
experiments many times, and using statistics to find "true" values - it's the
only way we can measure something. i understand that computer scientists mostly
work with deterministic stuff, and therefore feel a little uncomfortable with
this, but i can assure you, it's just fine :-)
all you need is enough measurements (personally, i find 24 or 30 positions
rather little). you can also repeat the same test a number of times, and see
whether the results really vary wildly...


>The 2.8 number came from the same test set used in the DTS paper as for some
>reason, Vincent thought that Crafty would produce _zero_ speedup on those.  GCP
>ran the test on a quad 550mhz machine of mine.  The 3.1 was produced by my
>running the _same_ test set on my quad 700mhz box.  I sent both the log file so
>they can confirm both my 3.1 and GCP's 2.8.  That just shows the variability.  I
>have seen one 3.4 on that test set BTW, whether it might do even better is just
>a guess.  And whether today's Crafty will do better or worse on that particular
>problem set is also unknown although I should probably run it to see, since so
>much has changed (evaluation, extensions, etc) in the past couple of years.
>
>I _believe_ there were 30 positions, but if you are looking at the DTS paper, we
>used exactly those positions so it will give you the right number of FEN
>strings...
>
>
>
>
>
>
>>2) can you give a similar error estimate for the 3.1 number (both std. dev and
>>std. error)? or even better, a full set of numbers so that i can do with them
>>whatever i want, since you seem so reluctant to compute std/ste? :-)
>
>What I can do is run a 1, 2 and 4 cpu run and either post the entire log, or
>just the "time line" grepped from each log to give the time and total nodes
>searched...
>
>If you want just the grepped info, the next step would be for me to give you one
>set of data for 1 cpu, and maybe 4 sets of data for 2 and 4, so that you can see
>the error between positions as well as the overall error or variance...

yes, that would be nice!


>>3) right, question 3 of 2 :-): you claimed somewhere deep down in the other
>>thread that it matters whether you look at related or unrelated positions. you
>>could prove/disprove this experimentally with a set of related positions (eg
>>from games of crafty on ICC) vs. a large test set (e.g. WAC).
>
>
>Yes, although I think the basic proof is trivial.  On related positions you
>simply search deeper due to the hash table effects.  Schaeffer and others have
>repeatedly found (myself included I failed to add) that deeper searches make the
>search more efficient.  But doing this test is harder.  IE it isn't reasonable
>to search to "fixed depth" for each position as that is not how it works in a
>real game and it can skew the times somewhat...  adding yet another bit of
>variability...

i see - admittedly this makes it a bit more problematic. still, you could run it
to a fixed depth, it's better than nothing. and even if you are right, and it is
trivial that it is like you say: aren't you at all interested to see how big the
difference is? :-)


>If you want me to run it and post the grepped numbers, you will see why I don't
>do it often.  There is a _lot_ of variability.  IE for four processors I feel
>perfectly comfortable claiming 3.0 +/- .3 for example.  That +/- .3 is a pretty
>big spread but within reason.  I am also certain that testing on problem sets
>produces different results than testing on a real game.  But using a real game
>makes it difficult for us to compare program A with B, since they wouldn't play
>the same game, and testing on different test sets can easily produce different
>numbers...

the results in the DTS paper had a very small variability compared to the 0.3
you just quoted. of course, i'm talking about the standard error of the average
here, not the variance.

cheers
  martin



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.