Author: Robert Hyatt
Date: 08:18:52 05/05/04
Go up one level in this thread
On May 05, 2004 at 05:10:36, martin fierz wrote: >hi bob, > >rereading your DTS paper (you sent me a copy once), you reported 24 speedup >numbers for 4 processors (given in the end, for anybody interested). > >i get (using a black box): > >av. speedup: 3.65 >standard deviation of sample: 0.31 >standard error of average 0.064 > >so: average speedup(N=4) = 3.65 +- 0.07 would be a nice way to put this. Where does the standard error come from? Are you looking at the speedups for two positions and using the difference (summed over all positions) as the error? That's one kind of error. The other is the non-repeatible error which is the real problem that needs addressing. If I run the _same_ test again, rather than 3.65 it might produce 3.3 or 3.9 which is the real problem... > >for those who don't have the paper, this was done on a cray, so it's not >comparable to crafty on an average N-way box you might have (and methinks this >experiment was done with cray blitz). Correct... > >this leads to two follow-up questions: >1) where does the 3.1 for crafty come from you usually quote? did you ever >publish a similar set of numbers for crafty? any .pdf / .ps to download for >that? where do the numbers 2.8 / 3.0 of vincent+GCP come from? how many >positions were in that test? The 3.1 comes from running a large number of positions several years back. I am pretty sure that I posted the positions and the actual times/logs here, but I won't try to guarantee that... The 2.8 number came from the same test set used in the DTS paper as for some reason, Vincent thought that Crafty would produce _zero_ speedup on those. GCP ran the test on a quad 550mhz machine of mine. The 3.1 was produced by my running the _same_ test set on my quad 700mhz box. I sent both the log file so they can confirm both my 3.1 and GCP's 2.8. That just shows the variability. I have seen one 3.4 on that test set BTW, whether it might do even better is just a guess. And whether today's Crafty will do better or worse on that particular problem set is also unknown although I should probably run it to see, since so much has changed (evaluation, extensions, etc) in the past couple of years. I _believe_ there were 30 positions, but if you are looking at the DTS paper, we used exactly those positions so it will give you the right number of FEN strings... >2) can you give a similar error estimate for the 3.1 number (both std. dev and >std. error)? or even better, a full set of numbers so that i can do with them >whatever i want, since you seem so reluctant to compute std/ste? :-) What I can do is run a 1, 2 and 4 cpu run and either post the entire log, or just the "time line" grepped from each log to give the time and total nodes searched... If you want just the grepped info, the next step would be for me to give you one set of data for 1 cpu, and maybe 4 sets of data for 2 and 4, so that you can see the error between positions as well as the overall error or variance... >3) right, question 3 of 2 :-): you claimed somewhere deep down in the other >thread that it matters whether you look at related or unrelated positions. you >could prove/disprove this experimentally with a set of related positions (eg >from games of crafty on ICC) vs. a large test set (e.g. WAC). Yes, although I think the basic proof is trivial. On related positions you simply search deeper due to the hash table effects. Schaeffer and others have repeatedly found (myself included I failed to add) that deeper searches make the search more efficient. But doing this test is harder. IE it isn't reasonable to search to "fixed depth" for each position as that is not how it works in a real game and it can skew the times somewhat... adding yet another bit of variability... > >why is this important? without error estimates, you can discuss forever whether >2.8/3.0 are the same as 3.1. without hard data on 3) you can also discuss >forever whether the issue in 3) matters or not, and if it does, in what way and >how important it is. > >this is a simple experiment to do, and since my profession is about measuring >numbers i don't understand that you don't do it ;-) If you want me to run it and post the grepped numbers, you will see why I don't do it often. There is a _lot_ of variability. IE for four processors I feel perfectly comfortable claiming 3.0 +/- .3 for example. That +/- .3 is a pretty big spread but within reason. I am also certain that testing on problem sets produces different results than testing on a real game. But using a real game makes it difficult for us to compare program A with B, since they wouldn't play the same game, and testing on different test sets can easily produce different numbers... > >cheers > martin > > >results in table 4 for 4 processors: >3.4 >3.6 >3.7 >3.9 >3.6 >3.7 >3.6 >3.7 >3.6 >3.8 >3.7 >3.8 >3.8 >3.5 >3.7 >3.9 >2.6 >2.9 >3.8 >3.9 >4.0 >3.7 >3.8 >3.9
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.