Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Parsing enormous.pgn

Author: Steven Edwards

Date: 06:16:29 04/08/05

Go up one level in this thread


On April 08, 2005 at 00:23:15, Tor Alexander Lattimore wrote:

>First, is it alright to use enormous.pgn for a book?

Sure.  But I would be careful with the MinPlayCount parameter as there are a lot
of duplications and a lot of questionable games.

>Secondly, i've been trying
>to parse it recently and my program seems to be doing fine until about 300,000
>games where it just returns EOF. I've tried opening and reading from other large
>files and get the same problem. Initially I tried using C++'s <iostream>
>library, but when that didn't work I tried standard C fopen() and fgetc() with
>no more success. The file is 900 MB, so shouldn't be a problem where windows
>does strange stuff with 2GB or > files.

When I first tried parsing a copy of enormous.pgn a year or so ago, I
encountered a number of difficulties.  Some of them I remember were:

1. There were more than a few out of range [0x0a,0x0d,0x20..0x7e] characters in
the data, and not all of these are inside character literals.  Some appear
between games.

2. There were some PGN tags I had never heard of, and so I had to adjust by
parser/compiler to handle these.

3. Be careful in that there might be some tag names like "EventDate" that could
trigger a false "Event" match.

4> I seem to recall that some of the recursive annotative variations were bogus
(i.e., syntactically incorrect).

After cleaning up and de-duping my copy of enormous.pgn, I have the file e.pgn
with 1400895 games:

[cynthia:~/Arena/Symbolic/PGN] sje% wc e.pgn
 22724727 187736744 802942905 e.pgn

I could upload this to an ftp site if one is available.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.