Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Less Mud, More Light (see also comp-comp)

Author: Stephen A. Boak

Date: 16:04:29 01/09/00

Go up one level in this thread



On January 09, 2000 at 08:14:28, Graham Laight wrote:

>I fear that this thread is becoming uninteresting to most people, but lets give
>it one more shot...

Sometimes the hardest thing to do is read other peoples minds, let alone 'most
of them'.  :)

I'm sure you are right--if we go in circles, avoiding collisions, repeating the
same things in the same way.  Just like some boring discussions about the
relative merits of two strong programs, people would rather see the programs
fight in a real match, going toe to toe, pitting the strong points of each
against the weak points of the other, and thereby test their relative merits.

Hmmm, doesn't that sound like the Rebel challenge matches?

>
>On January 09, 2000 at 05:48:57, Stephen A. Boak wrote:
>
>>>On January 08, 2000, Graham Laight wrote:
>>
>><much snipped from all discussions below>
>>
>>>I can think of one similarity - they're both a group of players trying to win a
>>>game of chess under the same rules.
>>
>>Hey, one SIMILARITY--pretty good!  Did it take long to think of it?  :)
>>
>>Let me help you with another similarity that you obviously overlooked--both
>>groups of opponents tend to favor Queens over Pawns for material purposes!
>>Another astounding similarity, if you just think about it!  I didn't really
>>think about it--I just blurted it out because the thought hit me.  Someone else
>>can work out the profundity in the observation.
>>
>>I have additional, inane similarities if you need to draw on more ammunition.
>>For a lost cause, however, I will not waste my time.
>>
>>On the other hand, I can think of several (meaningful) DISSIMILARITIES--
>>
>>1) One group of opponent players (i.e. opponents of the comps) is ALL COMPUTERS,
>>but the other group of opponents is ALL HUMANS.
>
>In terms of playing chess, you might as well have divided the pools into brown
>eyed people and blue eyed people. They are still entities of one type or another
>who are trying to win a game of chess.
>
>>2) NONE of the opponents in one group are opponents in the OTHER group.
>>
>>3) The ratings of the ALL COMPUTER opponents in the comp-comp pool are all
>>derived from relatively recent 100% comp-comp prior play (referring to SSDF
>>ratings), not from recent comp-human prior play.  On the other hand, the ratings
>>of the ALL HUMAN opponents in the comp-human pool would be all derived from
>>relatively recent human-human prior play (FIDE ratings).
>>
>>4) The SSDF ratings gained by the ALL COMPUTER opponents are not FIDE ratings.
>>The FIDE ratings gained by the ALL HUMAN opponents are not SSDF ratings.
>
>Points 2-4 could apply equally to brown/blue eye classification of players.
>
>>5)  The historical connection between SSDF and FIDE ratings is absolutely zero
>>(SSDF ratings were seeded many, many years ago with Swedish ratings, not FIDE
>>ratings, of relatively low strength players) or so remote as to be of no weight
>>today after many years of only comp-comp play).
>
>The FIDE web site, + FIDE members have disputed this. If you wish to use this as
>a working assumption, it is YOU who must provide evidence to support it.

I want evidence and logical reasoning for *your* postings, not broad references
to unspecified persons and arguments not here disclosed or specifically
identified.  Otherwise you are merely indicating 'others say' or 'they said'.
Who are those others?  What did they say that *you* agree with?

Where are the FIDE website links you call upon to defend your postings, and
which words in those links do you rely on?

>
>>6)  A) In statistical sampling, the results of sampling are meaningful (useful
>>to draw conclusions with high degree of confidence) only within the normal
>>(central) range of the sampled items.  The results (conclusions) of sampling
>>become less meaningful as applied to items closer and closer to the boundaries
>>or range limits of the sampled items.  It is mathematically improper (illogical,
>>without foundation) to draw conclusions about characteristics of items far
>>outside the sampled range.  Also, when something is measured against a scale,
>>the accuracy of the measurements is meaningful only within the normal(central)
>>range of the scale.
>
>Then an Elo rating of, say, 1500 can never be compared with an Elo rating of
>2800, because they are not in the same range.

When you use the reductio ad absurdum technique, be sure your premises are
carefully considered and that your logic is working, otherwise the technique
fails to convince the critics.  I never said that, nor do I agree that that is
entirely a logical conclusion stemming from what I've said, as worded.

I in fact believe that many ratings far apart may be properly compared.  I do
grant that 2800 is hard to compare with a high degree of confidence with *any*
substantially lower rating, since it is at the utmost extreme of typical ELO
scales such as the FIDE one, in which only one person, Kasparov, has attained
such a high rating.  But my point is broader than one using that specific high
rating example.  See the following.

In a large ELO system (Pool), there are many, many players at most points in the
huge rating range that encompasses all the players.  Of course an exception
exists for the very cream of the crop players (perhaps over 2700 rating, for a
rough example).  The Pool players mix many, many times with each other,
continuously over the years, at all adjacent 'band' areas in the range, causing
significant overlap between players in one rating 'band' (class) and another.
There is no large disjointedness between class 'bands', since they are free to
play back and forth in open events.  Even middling players can occasionally play
very strong Masters in the first round of large Open tournaments, so mixing
across greatly separated rating 'bands' also helps preserve relative uniformness
of the rating scale across all 'bands'.   The entire Pool (exception the very
extremes) provides one huge rating scale that is calibrated smoothly from low to
high.  That scale may not be perfect (no scale ever is), however it is highly
consistent and mathematically useful (with a high confidence level) for
comparing playing strengths across virtually the entire range.  That scale is
generally uniform and highly useful for comparing STRENGTHS WITHIN THE SPECIFIC
POOL.

I grant that rating scales are valid, as above described, for strength
comparisons *WITHIN* each specific ELO system, whether SSDF ELO COMPUTER POOL,
or HUMAN ELO POOLS such as USCF, FIDE, etc.

NOTE, however, that the SSDF mixing across 'bands' is not as extensive as, for
example, the USCF or FIDE systems, since every few years they largely quit
testing latest generation programs against hardware/software generations that
are several years old.  Within USCF and FIDE, that mixing is continuous, year
after year.  There is therefore 'some' mixing among bands in SSDF, adequate
(subject to varying *confidence levels* for rating comparisons) within that
system to produce reasonable uniformity across their entire rating scale.

Cross-pool comparisons, however, require cross-pool calibration, and the
meaningfulness of such (i.e. confidence levels) is solely determined on the
degree and kind of cross-pool calibration.

>
>>    B) Even if one assumes the initial Swedish seeding ratings were highly
>>comparable to FIDE ratings, there is a major problem, due to point 6A, above.
>>If relatively weaker Swedish players (i.e. not FIDE IM and GM strength, in
>>general) were used as the 'scale' against which SSDF seeded or 'initially
>>measured' comps to establish their relative strength versus humans, the seeding
>>not only was so long ago as to be of no weight in today's SSDF comp-comp
>
>Then the FIDE ratings of today are not valid compared to the FIDE ratings of 15
>years ago, because it has been so long since they were "seeded".
>
>>ratings, but was also established against a 'scale' of limited range and
>>significantly lower human rating average than the alleged ratings of modern
>>programs which that seeding allegedly somehow helps validate today.  The problem
>>is that it is a violation of statistical logic to use a low level rating 'scale'
>>or range (i.e. the relatively low strength Swedish players) to today assert a
>>FIDE-equivalent comp (vs human) high level rating, which would lie virtually
>>(probably 100%) OUTSIDE THE LIMITS of the initial 'scale'.
>>
>>>>previously, Steve Boak wrote:
>>>>Analogy: Two human runners, ranked in track sports--one (A) very good at long
>>>>distance events but very poor at sprint events; the other (B) of medium ability
>>>>in either type of event.  If they both enter a long distance events, A is likely
>>>>to do better than B.  If they both enter sprint events, B is likely to do better
>>>>than A.  Now A has not changed, nor has B changed--same runners in both event
>>>>Pools, each ranked correctly in relative ability in both types of events.  Yet
>>>>their rankings switch places in the different events.  There is no failure of
>>>>the ranking system.
>>>>
>>>>Why?  The competitors entered two events whose compositions are vastly different
>>>>in general--sprint event contains mostly sprint specialists; distance event
>>>>contains mostly distance specialists.
>>
>>>But in the case of chess, they're both a group of players trying to win a game
>>>of chess under the same rules. Your analogy doesn't do much for me, I'm afraid.
>>
>>Do all blonde-haired opponent chess players have the same rating?  Do all
>>black-haired opponent chess players have the same rating?  If we held two large
>
>Relevance?

You know the relevance of this analogy, as indicated by your remarks below.
Substitue comp-comp Pool and comp-human Pool for blonde-haired player Pool and
black-haired player Pool.  Apply the same observations you have made below.

>
>>comp-human Pool events, using the same comp players in each Pool but only human
>>opponents restricted 100% to a single Pool by hair color, would the comps all be
>>rated and ranked the same after each Pool concluded their many games?  Yet by
>
>Depends on whether the human pools' ratings matched each other. This in turn
>depends on what steps had been taken to match the ranges.

Precisely.  You *do* get the point.

>
>>your remark, since the opponents of the comps are all a group of players who
>>want to play and win at chess, you would expect same relative ratings and
>>rankings for all comp entrants in the two Pool events?
>>
>>Hmmm.  By reductio ad absurdum (carrying/reducing this theme to its extreme),
>>this leads to the following--If two distinct opponent groups (Pools) are chess
>>players, and if all players want to win (regardless of group), then comp chess
>>results (mean individual ratings and relative comp ratings) for comp players in
>>both groups will be the same between both groups (Pools).  As ugly a syllogism
>>as I'd ever want to stand behind!
>
>It would be true if sufficient steps had been taken to match the rating scales
>of the 2 groups.

YOU ACKNOWLEDGE ONE BASIC POINT, THE KEY ONE IN THE OVERALL DEBATE, THANK YOU!
Now a proper discussion can explore the true points of contention.

>
>>>Given that the SSDF web site states it, and that SSDF members support the notion
>>>in this forum, the burden of proof should be on your side of the debate, not
>>>mine.
>>
>>Ok, you have opinions, but decline to produce convincing or original evidence of
>>their truth.  It is your right to avoid substantiating your claims.  Don't
>>appear shocked or dismayed if others don't adopt your opinions to the exclusion
>>of their own without further discussion on the merits of your *and their* cases.
>> Don't appear shocked or dismayed if others are not goaded by character
>>assignations made in public accompanied by wild claims of bias without substance
>>(evidence, not mere speculation).
>
>You don't need 100% proof of your case to win an intellectual debate - you
>merely have to demonstrate that the weight of evidence

But if you are talking about Kangaroos (comp-human relative FIDE ratings), you
can't furnish evidence of Turtles (comp-comp SSDF ratings) without showing some
logical connection.

Let's see the evidence.

Let's examine the weight of the evidence, as you have asked.

What else have we been trying to do for some time?

 is heavier on your side
>of the scale than it is on theirs. This is the limit of what I've been saying.
>
>So - Tiger beats Century 12-8. OK - it's at fast time speeds. OK - this isn't a
>massive staistical sample. I'm sure there are other flaws as well.
>
>However, the implication is

My next few comments are not about Tiger or Century, but about reasoning, logic
and evidence only.

Implication is opinion, not fact.  There is perhaps some reason to support it
but it is not yet proven.

>that Tiger is a stronger program than Century. It
>may be that if you ran another 20 games, Century would win. However, on the
>basis of the evidence we have, relatively weak though it is,

In our debate, my position has not been that your conclusion is wrong, but that
it is wrongly derived or inadequately justified.  I have emphasized over and
over that it is improper to boost it as 'proven' or 'justified logically'.  I
have emphasized that it is drawn from illogical implications without reason, or
on occasion from logical implications without adequate evidence to draw a
conclusion firmer than a mere opinion.

NOTE--I also do not have all the evidence I want for enough programs, but I do
point to the Rebel matches for hard evidence regarding that program.  This
evidence I happily disclose, weigh, discuss, and rely on to draw my own
conclusion about comp-human relative FIDE ratings at tournament time controls
(40/2).  Do you have any evidence, on topic (same time controls, etc), that
approaches that in weight?  If so, I haven't seen it.  All you have to do is
show it.  Leave the weighing to the readers.

 it is just as
>likely that in another 20 game match, Century would do even worse than 12-8 than
>it is that it would do better (it could, of course, achieve the same score).
>
>So it is OK to say that the result implies that Tiger is stronger than Century.
>
>From this, one could also speculate that, in a tournament against humans, Tiger
>would better represent the programs than Century would.
>
>That's it. I might have worded my original post in a way that makes more impact
>on the reader than a simple statement of my position would have. As a keen
>participant in debating competitions, it's a good habit to have.
>
>>Ok, others have opinions and have published their reasoning (however valid)
>>regarding their opinions, and you wish to ride on their bandwagon and rest your
>>case on their 'proof'.  Fair enough.  You have a right to agree with anybody's
>>ideas.
>>
>>My above points fulfill my burden of proof.  I don't know what else to say.  To
>
>I don't agree. As far as I can see, you have made a case that there are
>potential difficulties in matching the rating scales between 2 different pools
>of players.

I did make the case, didn't I?  :)

Maybe even a bit stronger than 'potential difficulties' wouldn't you say?

In my book, this does not constitute "fulfilling the burden of
>proof" that the top computers are over rated.

My first claim ( <===emphasis ) is that *you* have NOT proven that *your* claims
stated with such gusto (bluster?) respecting 'fact and finality' are indeed
factual or true.  I have demonstrated the basis for my opinion sufficiently for
my own purposes (and those who may agree with me).

I have pointed out how you, when criticized, grasp at straws to maintain your
skimpy position.  I have examined the methods employed by you in your rhetoric
which do not address the merits of your claims or the claims of your detractors
(i.e. you evade the points, you change the topic to evade the points, you use ad
hominum arguments in lieu of discussing logic and evidence).

Even now, recently, you have shifted your position from weak attempts to justify
your conclusions--to calling for your opponent to carry the burden of proof,
rather than to assume the intellectually honest duty of defending your own
declarations.  I am not saying you are in any way personally dishonest, but that
you should accept the moral responsibility for your actions, for your postings,
even when criticized--defend them, recant them, or say that you really don't
know; all are proper ways to accept responsibility for what you have said.

To some degree, in some postings, you acknowledge that some of your points are
mere opinion, or that some of my points or your detractors's points are valid.
Thank you, that is a good beginning at arriving at a better joint understanding
of the subject and its truth by meaningful discussion (agreeing and disagreeing,
and indicating why) and trading of ideas.

>
>>my knowledge I have addressed every major point both you and those whose
>>opinions you agree with have raised in an attempt to support your position that
>>comp-comp ratings (SSDF) have a bearing on comp-human (FIDE) ratings for comp
>>programs.
>>
>>If you wish to debate or discuss further, it would be my pleasure.  Please,
>>however, leave out the personal when waxing philosophical, as will I, as much as
>>possible.  I promise to tone it down.  And please address *others points* not
>>just repeat your own or those of others you agree with--we have heard them all
>>many times before, ad nauseum.
>>
>>I don't expect to win you over.  I don't expect to 'win' any debate.  That is
>>not the real enjoyment of this forum--winning an argument.  I do believe it
>>would be loads of fun to discuss new points, analyze new evidence (example the
>>Rebel match results), and explore the wonderful world of computer chess!
>>Yes--that includes the rating controversy!  :)
>>
>>Take care, and once again, here's to a year of interesting postings--cheers!
>>
>>--Steve
>>
>>
>>>>>>You are very good a taunting.  Ever think of being an attorney?  You would be
>>>>>>ecstatic at cross-examining a hostile witness when the judge gives you a rather
>>>>>>free hand.
>>>>>
>>>>>May I return the favour
>>>>
>>>>Yes, by all means (well maybe *not* by ALL means!).
>>>>
>>>>and offer you some career advice as well, Stephen? You'd
>>>>>make a good comedy script writer.
>>>>
>>>>I should think so.  I get enough practice reading and writing them here.  :)
>>>
>>>This is another notion that requires evidence and proof. If you wrote comedy
>>>scripts, would the listeners tune in week after week, or would they switch off
>>>after the 1st episode?
>>
>>Point 1--true.
>
>Not to worry - all great comdedians have "died a death" early in their careers!
>One just has to persist, and keep improving (like the chess programs! :)  ).
>
>-g
>
>>Question 1--now you are getting the point!
>>
>>[I myself am bored to tears with banal and repetitive postings that add little
>>to the body already published.  NOTE--This is not resentment directed at new
>>posters, nor at those relatively new who haven't been through the same
>>discussions of yore.  Nor is it a lack of respectful tolerance for continual
>>polishing of old topics and ideas, as we learn to communicate ideas in better
>>fashion, or interpret them anew in light of new thoughts or evidence.]
>>
>>--Steve Boak
>>
>>>-g
>>
>>>>--Steve
>>>>>>--Steve
>>>>>-g



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.