Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: a number and two questions for bob

Author: Robert Hyatt

Date: 08:18:52 05/05/04

Go up one level in this thread


On May 05, 2004 at 05:10:36, martin fierz wrote:

>hi bob,
>
>rereading your DTS paper (you sent me a copy once), you reported 24 speedup
>numbers for 4 processors (given in the end, for anybody interested).
>
>i get (using a black box):
>
>av. speedup: 3.65
>standard deviation of sample: 0.31
>standard error of average 0.064
>
>so: average speedup(N=4) = 3.65 +- 0.07 would be a nice way to put this.

Where does the standard error come from?  Are you looking at the speedups for
two positions and using the difference (summed over all positions) as the error?
 That's one kind of error.  The other is the non-repeatible error which is the
real problem that needs addressing.  If I run the _same_ test again, rather than
3.65 it might produce 3.3 or 3.9 which is the real problem...


>
>for those who don't have the paper, this was done on a cray, so it's not
>comparable to crafty on an average N-way box you might have (and methinks this
>experiment was done with cray blitz).

Correct...

>
>this leads to two follow-up questions:
>1) where does the 3.1 for crafty come from you usually quote? did you ever
>publish a similar set of numbers for crafty? any .pdf / .ps to download for
>that? where do the numbers 2.8 / 3.0 of vincent+GCP come from? how many
>positions were in that test?


The 3.1 comes from running a large number of positions several years back.  I am
pretty sure that I posted the positions and the actual times/logs here, but I
won't try to guarantee that...

The 2.8 number came from the same test set used in the DTS paper as for some
reason, Vincent thought that Crafty would produce _zero_ speedup on those.  GCP
ran the test on a quad 550mhz machine of mine.  The 3.1 was produced by my
running the _same_ test set on my quad 700mhz box.  I sent both the log file so
they can confirm both my 3.1 and GCP's 2.8.  That just shows the variability.  I
have seen one 3.4 on that test set BTW, whether it might do even better is just
a guess.  And whether today's Crafty will do better or worse on that particular
problem set is also unknown although I should probably run it to see, since so
much has changed (evaluation, extensions, etc) in the past couple of years.

I _believe_ there were 30 positions, but if you are looking at the DTS paper, we
used exactly those positions so it will give you the right number of FEN
strings...






>2) can you give a similar error estimate for the 3.1 number (both std. dev and
>std. error)? or even better, a full set of numbers so that i can do with them
>whatever i want, since you seem so reluctant to compute std/ste? :-)

What I can do is run a 1, 2 and 4 cpu run and either post the entire log, or
just the "time line" grepped from each log to give the time and total nodes
searched...

If you want just the grepped info, the next step would be for me to give you one
set of data for 1 cpu, and maybe 4 sets of data for 2 and 4, so that you can see
the error between positions as well as the overall error or variance...


>3) right, question 3 of 2 :-): you claimed somewhere deep down in the other
>thread that it matters whether you look at related or unrelated positions. you
>could prove/disprove this experimentally with a set of related positions (eg
>from games of crafty on ICC) vs. a large test set (e.g. WAC).


Yes, although I think the basic proof is trivial.  On related positions you
simply search deeper due to the hash table effects.  Schaeffer and others have
repeatedly found (myself included I failed to add) that deeper searches make the
search more efficient.  But doing this test is harder.  IE it isn't reasonable
to search to "fixed depth" for each position as that is not how it works in a
real game and it can skew the times somewhat...  adding yet another bit of
variability...




>
>why is this important? without error estimates, you can discuss forever whether
>2.8/3.0 are the same as 3.1. without hard data on 3) you can also discuss
>forever whether the issue in 3) matters or not, and if it does, in what way and
>how important it is.
>
>this is a simple experiment to do, and since my profession is about measuring
>numbers i don't understand that you don't do it ;-)


If you want me to run it and post the grepped numbers, you will see why I don't
do it often.  There is a _lot_ of variability.  IE for four processors I feel
perfectly comfortable claiming 3.0 +/- .3 for example.  That +/- .3 is a pretty
big spread but within reason.  I am also certain that testing on problem sets
produces different results than testing on a real game.  But using a real game
makes it difficult for us to compare program A with B, since they wouldn't play
the same game, and testing on different test sets can easily produce different
numbers...




>
>cheers
>  martin
>
>
>results in table 4 for 4 processors:
>3.4
>3.6
>3.7
>3.9
>3.6
>3.7
>3.6
>3.7
>3.6
>3.8
>3.7
>3.8
>3.8
>3.5
>3.7
>3.9
>2.6
>2.9
>3.8
>3.9
>4.0
>3.7
>3.8
>3.9



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.