Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Found Part of the Problem

Author: David H. McClain

Date: 08:44:54 03/05/05

Go up one level in this thread


On March 05, 2005 at 01:43:29, Dann Corbit wrote:

>On March 04, 2005 at 22:58:40, David H. McClain wrote:
>
>>On March 04, 2005 at 15:51:55, Dann Corbit wrote:
>>
>>>Here is what I do:
>>>1.  I run the cleaner in SCID, which removes lots of duplicates.
>>>2.  I run pgn-extract by Barnes with -dnul to get rid of more duplicates.  It
>>>almost always finds some.
>>>3.  I run ChessAssistant's duplicate finder against the data and keep
>>>"Essential" while removing "Discarable" from the set.
>>>
>>>I run one more cycle of the three steps above.
>>>
>>>After that, I do not find more duplicates.
>>>
>>>I expect that ChessBase will be similar to ChessAssistant.
>>>
>>>Finding duplicates is a very difficult thing to do, when you think about it.
>>>
>>>Two different players could play exactly the same moves -- especially in a short
>>>game.
>>>
>>>Bobby Fisher might be spelled:
>>>Bobby Fischer
>>>Robert Fischer
>>>Robert J. Fischer
>>>B. Fischer
>>>Fischer
>>>R. J. Fischer
>>>etc.
>>>
>>>To complicate things, chess programmers sometimes name their creations after
>>>famous chess players.
>>>
>>>It is also inevitiable that some of the duplicates thrown away will not really
>>>be duplicates.
>>
>>Dan,
>>
>>Thank you.  I guess I'm trying to split hairs with this and catch or save every
>>last game.  I was referring to only machine games so I guess the possibilites of
>>incorrect names is greatly decreased.  I should have mentioned that.  For
>>creating and editing to fine tune an opening book, I suppose a few lost games or
>>a few duplicates won't make much difference as long as the games have their
>>integrity in a data base that is small (~100,000 games) by today's standards.
>>Since a much more experienced person than myself states similar difficulties,
>>perhaps I should relax a bit!  DHM
>
>You will find lots and lots of name errors im machine generated games, even by
>the most careful contest operators.

Dan, William,

This may be old news to more experienced, and while working with some of your
suggestions, I ran another couple tests.

It appears some games do not show as duplicates because they were saved and
posted in the old CBA format while the same games were also saved in the newer
CBH format.  Many games ARE duplicates because the length of the games vary form
each format by 1 (with CBA -1 length).  Consequently, if you search for twin
games to include "same length," you won't find many of these duplicates.  Once
they are all saved into CBH format, the duplicates become clearer.  DHM





This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.