Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Knee jerk reaction!

Author: Sune Fischer

Date: 11:18:01 09/13/04

Go up one level in this thread


On September 13, 2004 at 10:34:55, Robert Hyatt wrote:

>>But Bob, I _want_ to force into positions it doesn't normally play, that is the
>>_idea_.
>
>That is the _flawed_ idea.  Because the _opponent_ is forced into similar
>positions.  Suppose the opponent does worse.  Does that mean this program is
>better?

Correct.

You'd need more than one position to say for sure which engine is better under
those circumstances though, the usual laws of statistics apply.

>Not if you play them normally.

Playing them normally is just testing a different setup.

All in all I don't think there is much information to be extracted from only one
experiment, several sub-experiments with different setups are needed to resolve
the finer details.

>Put two programs into positions their authors didn't intend, and the results
>won't mean much, which was my point...

Define "mean much"?

The result is the result, nothing more nothing less, it speaks for itself.

Can Ruffian on a single beat Crafty on a dual?
Some might find that answer interesting and others might not, there is no more
to it than that.

The test is not flawed, only the interpretation of the test can be flawed.

>If you think a program should be able to play all positions well, I agree so
>long as "should" is included.  But replace it by "does" and it becomes false.

I'd be surprised if you can quote me on saying "does" in that context :)

>>When you use a specially designed very narrow book it might never ever play d4,
>>but lots of people are interested in d4 openings and wants to see how the engine
>>does here.
>
>
>How about convincing Korchnoi to not play 1. d4 in an important game?

If I were to purchase either Korchnoi or Kramnik for my private analysis
sessions, I'd demand to see them break out 1.e4!

Imagine if they played like 2200 with e4, important to know if you are going to
use them for yourself. :)

>>In analysis the engine cannot pick and choose its own narrow set of test
>>positions, you can kick and scream all you want but it will _have_ to be good on
>>a wide range of very different type positions.
>
>No it doesn't, and that is the flaw in your assumption.  You might _want_ it to
>be good in a wide range of positions, but that won't make it so, for any program
>around.

Actually I think it is your basic assumption that is flawed.

You seem to think that the only way engines are being used is to play games from
beginning to end with everything enabled.

This isn't so however, they are used for analysis on really weird positions,
they are used with and without books, with and without pondering, on short and
long time controls etc...

Doing well under various circumstances is important, perhaps not for you but for
all those using Crafty under those conditions.

I too have my favorite setup and I don't test much else, but to each his own.

Say for instance that I test Crafty my way and I find the Crafty engine has a
2500 Elo (on my private list) with ponder off, books off etc..

Then I test Crafty with book and Crafty scores 2530.

Then I test Crafty with book and learning, Crafty scores 2680 (it will really
kill some amateurs with this).

Then I also enable table bases and Crafty scores 2690.

That is how I would do it, you would just run a single big experiment, which
probably would produce 2690 also and then what?

The result is that I know a little more about Crafty behavior than you do :-)

>That is an impossible condition (roughly equal).  What is "equal" depends on the
>program.

No it does not :)
If you flip the positions you have symmetry, symmetry is equal.

>>
>>Because I don't intend to use it in tournaments (only the author is allowed to
>>do that), my purpose is to use it for analysis!
>
>Again, to the man who has a hammer, _everything_ looks like a nail.  Chess
>programs are not particularly good "general solutions" to the chess problem...

That's a theory, now let's go out and measure that and see how true it is.

>>I use games because I don't believe it is possible to generate a representive
>>set of test positions.
>
>If you can't produce a set of positions, then how is it possible to do the same
>by choosing random openings instead???

Test suites are usually tactical and you have to find some that has only one
solution, and it must be a solution that cannot be found for the wrong reasons.

>>By playing from lots of equal but complicated endgame positions you will be able
>>to tell who is the better endgame player.
>
>How, if one is tactically weaker but much stronger in endgames?  You won't be
>reaching many endgames...

Yes that's why playing games from the beginning won't work if you need a good
engine for endgame analysis :)

I have in my system a rating for endgame performance and a rating for middle
games.

Sometimes you can improve the endgame Elo without doing much better in the
middle game against strong engines, you simply don't ever survive to the endgame
with an equal position.

It's also a bit waste of time testing endgame knowledge by playing all the way
from the beginning, but that's another matter.

>>>What user is qualified to figure that out?  When the programs are so much
>>>stronger than 99.9% of the humans that are trying to figure this out...
>>>
>>>IE bozos can't really decide which brain surgeon is the best...
>>
>>Here is how this bozon would do it, he would test it, look at the result,
>>perhaps grind some statistics and draw his conclusions.
>>Standard procedure really. :)
>
>So you count 1-0, 0-1 and 1/2-1/2???

Yep, just need a heck of a lot of games :)

>Not always the best system when trying to see which program is better
>positionally...

I find that seperating tactical and positional strength is hard to do,
unfortunately.
A lot of "tactics" gets found for evaluation reason sometimes.

>>Why you don't find these experiments interesting I have no idea.
>
>There is a difference between _ME_ doing experiments to figure out what to do to
>make my program better, and an end-user doing such experiments and simply
>drawing conclusions from them.  I tailor experiments to explore specific things.
> I don't just run random tests to see what happens...

If it's user modified or using a special "personality" it is usually stated, so
I don't see any problem in that.

It's just people having a bit of fun.

>>
>>>>My job is easy, I just need one single counter example :)
>>>
>>>No, you have it backward.  You are saying something is OK.  I have given more
>>>than one counter-example of why it is _not_ ok...  You can't give one example of
>>>why it _is_ ok and then conclude "it is ok."
>>
>>I can certainly conclude that "it is ok" in that special case, the case we
>>happen to be talking about in fact.
>
>But if it isn't ok in _all_ circumstances, it is flawed.

I just gave some examples of when it wasn't flawed, so you can't say it's always
flawed.

>>If you are telling me that you cannot conclude anything from that, then yes I
>>will certain claim full disagreement.
>
>Then we disagree.

> Was the book bad and learning helped?

It was the official book :)

>Was the book good but bad luck with randomness hurt?
> Was the program bad but got good openings?  Was
>the program bad and got bad openings?

Randomess not a big issue with 10000 games. :)

>I don't see how to conclude anything from
>the results.

Then I guess I am just exceptionally gifted :)

>If I look at the games, I would learn far more.

Not necessarily. Doing good or bad in on position can be balanced out in other
positions by other factors.

>>>>To take another example, how are you going to use endgame tables on the
>>>>PocketPC?`
>>>>http://www.pocketgear.com/software_detail.asp?id=15142
>>>
>>>In 5 years the answer will be obvious.. :)
>>
>>Nevertheless there is currently a reason to test without endgame tables.
>>I guess I can rest my case here :)
>
>Care to guess how many pocket-users there are compared to normal users?

No clue, a million?

Of course most use cell-phones which are even smaller! :)

>>I can think of many interesting experiments, but this experiment I would have to
>>call crap.
>
>Ponder = on, one cpu?  Hardly crap at all.  Works like a charm.

..if you like mile long errorbars and systematic errors :)

>>Anything to win.
>>If I know people will be testing with ponder on a single cpu machine, then there
>>is every reason to annoy my opponent with searching at high priority.
>
>I'll play you a match on my linux box.  Feel free to start a high priority
>thread, but you won't be running as root and might find it difficult...

Linux, what is that?
;)

>>
>>>>How do you measure progress without reproducability?
>>>
>>>If trying to find out which program is better, A or B, reproducibility is _not_
>>>an issue.
>>
>>I said _progress_, not who is better.
>>
>>> Do you _really_ think that if you play me as a human, that I am going
>>>to play the same moves every time you do?  Yet even in spite of that lack of
>>>reproducibility, you can't tell whether you are better than I am?
>>>Reproducibility is great for debugging.  Not necessary for strength
>>>measurements.
>>
>>First of all I don't know why you keep comparing with humans, just because
>>reproducability is impossible for humans it doesn't have to be for machines.
>
>hmmm...  that is _my_ goal, in fact...

Interesting in theory, annoying in practice. :)

>>>Then he can't learn anything at all as there is no reproducibility in Crafty if
>>>the book is used.
>>
>>Right right right, now you are getting it. :)
>>
>>Disable the book!
>
>Then he _still_ can't learn anything because now we reach positions where Crafty
>would not normally reach.

Yes great, let's explore how Crafty does here.

>  Or if you start from move 1 with no book, you just
>get the same game over and over which might not be so easy to understand...

That would idiotic to do more than twice :)

>>Now as it happens you don't need an engine for playing full matches but only for
>>analysis, then you'll be wanting the engine which engine has the fewest weak
>>points, agreed?
>
>If you could find such a thing, yes.  I just don't believe the above
>circumstance exists in an easy to find way...

Not saying it is easy at all, it's always hard to measure things when the only
way is through statistical significant data.

Exactly how it should be done is not clear either, but for every experiment a
little more knowledge is added to the tree.

-S.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.