Author: Dann Corbit
Date: 12:25:08 05/22/01
Go up one level in this thread
On May 22, 2001 at 14:50:00, James T. Walker wrote: >On May 22, 2001 at 13:39:22, Dann Corbit wrote: > >>On May 22, 2001 at 13:27:09, stuart taylor wrote: >>[snip] >>>Yes, Crafty is number 4, which I overlooked. Sorry! >>>But I didn't overlook the comercial list. But that was very few games, which is >>>good, but says very little. But you can't just bungle all the amatuer programs >>>together with it to make CM8K to look so great overall. It's nowhere near the >>>same category as tests against recent comercial programs. >> >>In what way? >> >>According to the last WMCCC, the world champion was Shredder. The runner up was >>Ferret, an amateur program. >> >>In the previous CCT contest, in which several commercial programs particpated, >>the winner was... >>Crafty -- an amateur program. >> >>Look at the recent Leiden contest. Some amateur programs were near the top and >>triumphed over some commercial entries. There used to be a large gap between >>the amateur and commercial programs. I believe that the gap was mostly due to >>superior opening books of the commercial programs. That gap has narrowed, as >>the amateur entries now operate with sophisticated opening books. >> >>I believe that the gap between the strongest amateur programs and the strongest >>commercial programs is very small. Of course, there is not enough empirical >>data to back up my assertion, so it is only an opinion. > >What do you mean by "very small??" The latest SSDF list has Deep Fritz at 2650 >after 470 games and Crafty 17.07 at 2487 after 857 games. What would constitute >"enough empirical data??" On what empirical data do you base your "opinion?" 1 Deep Fritz 128MB K6-2 450 MHz 2650 34 -32 470 66% 2537 2 Fritz 6.0 128MB K6-2 450 MHz 2626 24 -24 897 66% 2512 3 Junior 6.0 128MB K6-2 450 MHz 2594 22 -21 1109 64% 2490 4 Chess Tiger 12.0 DOS 128MB K6-2 450 MHz 2578 27 -27 691 62% 2492 6 Fritz 5.32 128MB K6-2 450 MHz 2547 26 -26 741 59% 2485 6 Nimzo 7.32 128MB K6-2 450 MHz 2547 24 -24 857 59% 2485 8 Nimzo 8.0 128MB K6-2 450 MHz 2539 30 -30 546 58% 2486 9 Gandalf 4.32f 128MB K6-2 450 MHz 2529 29 -29 584 52% 2518 10 Junior 5.0 128MB K6-2 450 MHz 2528 26 -25 750 57% 2476 11 Hiarcs 7.01 128MB K6-2 450 MHz 2526 37 -37 361 48% 2539 12 Hiarcs 7.32 128MB K6-2 450 MHz 2525 27 -27 679 56% 2481 13 SOS 128MB K6-2 450 MHz 2524 23 -23 925 53% 2501 14 Rebel Century 3.0 128MB K6-2 450 MHz 2514 31 -31 504 50% 2516 15 Goliath Light 128MB K6-2 450 MHz 2496 30 -30 546 46% 2527 16 Crafty 17.07/CB 128MB K6-2 450 MHz 2487 24 -24 857 47% 2505 In order for confidence to rise from 2/3 to 95%, we must use two standard deviations. Within that error bar, Deep Fritz is: 2650-(32*2)= 2586 ELO. 2650 + 34*2 = 2718 and Crafty 17.07 (quite an old version) is: 2487 + 24*2 = 2535 ELO. 2487 - (24*2) = 2439 So, considering the error bars (2439, 2535) : (2586, 2718) we *can* say Deep Fritz is a little stronger with pretty good certainty. But (of course) crafty has gone through a raft of versions since then. Is the same difference still true? Yace is similar in strength. If given opening books of equal quality, I suspect that the best amateur programs (e.g. Yace, Crafty) are very close to even with the strongest commercial programs. I think that most people have no clue what the numbers in the SSDF list mean (and I mean not a *single* one of the numbers) and that's too bad. Not saying that you don't of course. But even the ELO figure is widely misunderstood. Does the autoplayer used to play these games still issue a reset between each move to the engines (which the commercial programs are designed to ignore)? Personally, I think the learning attribute of some programs is not a slant against the autoplayer, since the games will learn in actual use also and thereby improve. But it raises an interesting question. Have the learning programs been playing longer than new entries and gaining constantly with their learning files? If so, is the test accurate? In other words, I think: 1. The SSDF is definitely the best data we have available to determine engine/engine strength estimates 2. The data rapidly ages with new versions of programs [e.g. Tiger 12 -- aren't we on Tiger 14 now?, crafty 17.07 -- aren't we on 18.9 now?] 3. The data is valid ONLY for the machines in actual use and the exact conditions of the trials. On different architectures, the engines quite likely will play very differently. I have observed this effect very much so with Intel compiler builds running on older and on newer machines. Up to a factor of 4 in difference from what you would expect judging by CPU MHz alone. 4. There may be flaws with the experiment (but I have yet to see a better design) 5. The error bands in the strength figures are widely misunderstood. I don't know that you will agree with me, but I expect you can see what I am driving at by now.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.