Computer Chess Club Archives


Search

Terms

Messages

Subject: Why Position Tests are no good *Indicators* for Strength in Games!

Author: Rolf Tueschen

Date: 06:35:15 06/15/04

Go up one level in this thread


On June 14, 2004 at 22:12:30, Uri Blass wrote:

>The problem is that chess is a game and not list of single positions.
>
>I can mention 2 points:
>1)An evaluation can be dependent on the history of the game or the history of
>search of the programs and test suites do not reveal this information.
>
>2)imagine program that in 95% of the cases play tbe best move but in 5% of the
>cases play stupid blunders because of some bug.
>
>This program may do well in test suites that are not too easy but bad in games
>because the 5% of stupid blunders will usually decide the game against it.
>
>Uri



We can only hope that many believers in position tests (like this CSS WM-Test)
now have an approximate feeling why so many programmers refuse to praise such
tests although the test founders presented nothing less than the "best"
positions from Worldchampion practice (= games).


M. Gurevich
============

We have a serious problem here because there seems to be no hope that the main
author of the "CSS WM Test" could give in because he has connected the success
of his test with his own life's value so to speak. This is, carefully said, what
I could see in dozens of messages in the German CSS forum. The author is
convinced that IF only all the critics would adopt his test and doing some
testings, that then they would all be true believers into the qualities of his
[!] positions. As I said the author himself is a very experienced expert
chessplayer, below IM level of course, coming from the former URS, approaching
60 y., academic doctor, who has spent a lot of his best years and heart-blood
into the analyses of these positions from Wch history. He is almost the only
member in that German CSS forum, who gives actual and "profound" analyses of
each and every computerchess event. Perhaps the readers from the USA can only
imagine from the distance how time-comsuming this is and, which is even more
important, how much experience you must have to give your chess analyses to a
daily readership. In short Dr. Mikhail Gurevich is a real expert in chess, more,
he does share his knowledge with all the interested "little" experts and patzers
like you and me. You wont normally find IM or GM who were spending their
precious time with such things to please the spectators. So Mikhail is IMO
highly welcomed and praqised for his engagement. And this is also the way we
should treat our experts, also if they are not famous GM. I hope I made totally
clear with what a dedication I can see the overall value of such an expert.

I think that now after this praise I am also allowed to speak out the exact
faults which are now presented in the defense of the "CSS WM-Test".


The Faults in the Defense of WM-Test
====================================

Academic education or not, we have all experience in playing games in debates.
One of the most famous tricks is the hiding of the differences in the definition
of a topic. After a long and heated debate you can always start right from the
start with the reveiling of what was *really* meant. Basically all our
communication, outside science, is a variation of such misunderstandings because
we all have our very personal definitions about important things.

Now, in special Germans normally have a vice, different to Americans, a vice
wanting to know what something *really* meant in all details. This often leads
to ridiculous categorizing, and this is by far not excluded from science. No,
our science in psychiatry for example once was state of the art for the whole
world. And still, because we knew substantially so little at that time, all the
definitions were empty claims.  The definitions were so to speak theoretical
claims for states whose existence possibly could never be proven. Now guess you
become a victim of such defining (diagnosing!). Of course all what you say is
already part of the expected expression of the diagnosed state itself. You see
the devlish circle of never-return?

Therefore it is always useful to question the substance of abstract definitions
and claims. If that advice would always be followed we had much less confusion
in our debates.

Of course a test should allow to express something as a result. Dr. Gurevich of
course knows all the details of the field and he quotes from Steinwender &
Friedel. Almost all the following comes from
 http://f23.parsimony.net/forum50826/messages/101106.htm!

The first quote is this:

"Stellungstests sind eine schnelle und grobe Spielstärke-Einschätzung." This
means: *Position test are a fast and approximate estimation of playing
strength*. Mikhail adds: Kein Zitat! - Meaning *Not a quote!* He wants to say
that this is a quote by heart. Perhaps not exact.

If Mikail G. only could fully understand what the authors, both still
responsible for computerchess in Germany. Mikhail would run down into his cellar
and would ROTFL for an hour or longer. Because the little snip contains the
whole nonsense of the CSS WM-Test. Guess what!! If you want to run this test you
must have 33 hours for free! Just to get a *quick* and *approximate* guess for
the playing strength. Yes, Oblomov isn't by chance the symbol for the stoicism
in the Russian soul. All that if the worldwide know author M. Gontcharov is
right. But there is no question that he's right. Please read the little booklet
for yourself. :)

We must now ask how long it would take if we wanted to know the almost _exact_
guess for the playing strength of a chess engine? I leave the answer to the
readers themselves.

So, MG gives the famous definition of Steinwender [BTW chief member of the
actual CSS team for the German forum of CSS]& Friedel and still has no problem
with the Elo formula of his own test. Because? Let's see.

He declares: "Die Aussage „die ermittelte CSS-Elo-Zahl eines Programms darf man
als Indikator seiner Spielstärke betrachten ..." gehört mir. Was meinte ich
damit und was hat Lars nicht zitiert und nicht verstanden?"

Meaning: *The statement 'the calculated CSS-Elo-number of a program is allowed
to be taken as indicator of its playing strength...' belongs [sic! Rolf] to me.
What did I mean and what Lars [Lars is the one who gave the quote from CSS
jounal where Gurevich had made the declaration what his test all could serve
for, see above the top introduction in my messages - Rolf] didn't quote and
didn't understand?*

Let's read further in German before I translate again:

"a) ich behaupte gar nicht, dass CSS-Elo-Zahl GLEICH Engine-Spielstärke ist. Ich
meine aber, dass die Spielstärke eng mit der reinen Engine-Stärke (ihre
Algorithmen, Suchfunktion, Bewertungsfunktion etc) verbunden ist. Es gibt kaum
noch Fachleute, die sagen, dass Manfred Meilers Rangliste „ZUFÄLLIG“ mit den
Spielstärke-Ranglisten übereinstimmt. Und die Erklärung ist einfach: Die
genannten Eigenschaften in Klammern sind ein Kern der Spielstärke, während
Bibliothek, Zeiteinteilung etc. m.E. doch sekundär sind."

In English:

*a) I don't claim, that the CSS-Elo-number is identical with the playing
strength of the engine. But I mean that playing strength is closely connected
with the pure engine-strength (its algorithms, search function, eval function
etc). There are almost no more experts, who say that the ranking list of Manfred
Meiler coincidates just by _chance_ with the ranking lists for playing strength.
And the explanation is easy: the mentioned qualities in brackets are the core of
the playing strength, while opening books, time management etc are secondary.*



My comment: You have here the ultimate proof for the fault in Mikhail Gurevich's
thinking process. He thinks that he can avoid the big judgement about the
playing strength Elo from position tests because he argues his position test
does NOT result into playing strength Elo but analytical abilities Elo for the
engine strength. But in the same paragraphe Mikhail claims that the two are in
principal closely related together. But how Mikhail _then_ could avoid the
mentioned judgement that is a true mystery. :)

But we have a second fault already. MG does not see that by such a connective
bond he can't avoid the critics against the calculated Elo details. Because the
critic against the playing strength Elo is now also directed against his engine
analytical strength Elo! This is all very easy to understand.




"b) Lars Bremer hat m.E. sehr geschickt und wohl nicht ehrlich nur ein TEIL des
Textes In der CSS 5/01 zitiert. „Nicht gemerkt“ hat er den Text in den Klammern:
„Die ermittelte CSS-Elo-Zahl eines Programms darf man als Indikator seiner
Spielstärke betrachten, [denn sie charakterisiert das schachliche KÖNNEN…
Gleichzeitig erlaubt diese Elo-Zahl, die Programme nach ihrer ANALYSEFÄHIGKEIT
einzuordnen. … Gute AF eines Programms ist wohl eine VORAUSSETZUNG dafür, dass
es sein Können in Turnieren realisieren kann“]"


In English again:

*b) LB has IMO very cleverly and hardly honestly quoted only a part of the text
in CSS 5/01. "Not noticed" was the text in brackets:
"The resulting CSS-Elo-number of a program can be allowed to be taken as
indicator of its playing strength, {because it characterises the chessical
abilities... At the same time this Elo-number allows to make a ranking for the
analytical ability of the programs. ... Good AF = analytical ability of a
program is well the preposition that it can realize its anbilities in
tournaments"}*


We have now reached a period of foggy air. By chance the ranking list of Manfred
Meiler for the analytical ability Elo numbers of the different programs is
similar to the ranking lists of the playing strengths. So this is fine for the
test authors of "CSS-position Test" but the critic against the position tests as
such, for example the principal difficulty Uri Blass mentions here in his
message, of the best move practice, if only 5% of the moves are not the best -
then the program will simply lose its games in chess, this is part of the game.
And 5% non-best moves doesn't mean the program does only lose 5% of its games,
no, it loses theoretically all its games if it makes 5% bad moves against other
computer programs with a better percentage!!! We conclude that the whole theory
is faulty. It's simply nonsense and that is also the reason why no programmer
can successfully use the "CSS-WM-Test", against all the good of the support the
test gets from CSS journal and people.



Further in German:

"c) Lars Bremer hat erstaunlicherweise auch nicht gemerkt, dass der genannte
Artikel ein Fazit enthält (S. 14), wo kein mal etwas über Engine-Spielstärke
nach den WM-Test Resultaten gesagt wurde. Dagegen gibt es dort (wieder) die
Aussage über die AF. Vielleicht zitiert sie Lars selbständig im CSS-Forum?"

I dont know if I can give you with my translation the correct impression for the
somewhat clumsy style of the author in German:

*c) LB has astonishingly _not_ noticed, that the mentioned article contains a
summary (p.14), where not a single time something is said about engine playing
strength after the WM-Test results. To the contrary there is (again) the
statement about the AF = analytical abilities. Perhaps Lars will quote it into
the CSS-forum on his own?*


In German:

"d) Lars Bremer hat kaum aufmerksam CSS-Hefte gelesen (?!). Denn dort findet man
u.a. die Äußerungen, wie die WM-Tests Resultate (AF) mit den
Spielstärke-Ranglisten übereinstimmen, [...]."


In English:

*LB has hardly read the CSS journal with attention (?!). Because there one can
find among others the statements, how the WM-Test results (AF = analytical
abilities) coiincidate with the playing strength ranking lists.*


This is now redundant. MG thinks that by introducing a new term and then stating
the similarity of the results, his term compared with playing strength, that he
is now safe against the criticism against playing strength Elos. Yesterday I
gave some quotings where Gurevich showed a total lack of understanding for what
Blass today has just mentioned. A game of chess is NOT just a follow-up of
distinct positions. But the positions must be understood in their connect and
context. And just here is the difficulty for the programmers. As Uri Blass just
explained. It's not enough to know in how many positions of the 33 hours long
test of Gurevich a program was correct. It's more important what Stefan
Meyer-Kahlen said, namely to find special positions where the program fails to
find the correct solution. In other words with easy math. You can have a quote
of 98% correct solvable positions. But a programmer is interested into the 2%
incorrectly solves positions!!! Or as test theoreticians say: all these
positions can't differentiate properly between
the best programs or the programs of a certain range. So? Such a test is
meaningless. We can call it nonsense from that perspective.

A totally different topic is the fun we have when we analyse these positions
with our engines. But a programmer doesn'tr want to have fun before the
tournament but he wantsw to have fun when he has won a tournament!


Excuse the lengthly message.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.