Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: To anyone who has ever automatically created an opening book

Author: John Merlino

Date: 17:25:52 11/09/01

Go up one level in this thread


On November 09, 2001 at 15:44:07, Scott Gasch wrote:

>On November 09, 2001 at 11:54:42, John Merlino wrote:
>
>Hi John.  I don't have hard numbers available without re-creating the book.
>Which I will do tonight when I am home from work... and report more details if
>you want them.  But here is my recollection...
>
>>1) In general, how long does it take to create a book (i.e. size to time ratio).
>
>I create monsoon's book from a PGN file with over 500,000 games in it.  (I got
>this file from the week in chess (WIC) archives, pitt.edu, and chessclub.com's
>GM library, and other Internet sources).  I have built the book on two machines
>-- a 450Mhz/576Mb FreeBSD machine and a 1.2Ghz/1Gb Win2k machine.  It takes
>about 10 minutes on the FreeBSD box.  It takes about 3 minutes on the NT box.
>The final book binary is about 90Mb in size but this is "processed", see below.
>
>My book building process is to allocate a huge buffer in memory and lock it in,
>so that it does not swap.  Then to use this buffer as a hash table.  The PGN
>parser reads a game and "plays" each move _up to move 25_ (ply 50).  Each
>position signature + next move is used as a hash key to create an entry in this
>huge memory buffer (or, if the entry is already there and we've seen this
>position+move before, to update that entry's win/draw/loss ratio).  I use linear
>probing in the event of a collision in the hash.
>
>When adding a new entry to the buffer I keep track of the buffer's fullness.
>If, during the creation process, the buffer ever gets over 90% full I run a
>"strainer" on the buffer.  This strainer goes through the memory and drops
>unpopular entries (starting at "seen this position+move only one time") until
>the percent-full drops below 90% again.  The 90% is arbitrary but I figured the
>linear probing would be getting pretty slow at 90% full.
>
>On the NT machine with 1Gb of memory, my buffer is large enough that the
>intermediate straining never has to happen.  On the FreeBSD machine it happens
>once, aroudn game 300,000 in the PGN file.
>
>When the PGN file is exhausted I do a final strain on the book (to eliminate
>positions I have seen only once in all 500,000 games).  I do this to keep the
>book binary on disk a reasonable size.  Then I sort the buffer on position
>signature with a quicksort.  Once its sorted I dump it out to disk.  That's my
>book.
>
>I originally was building the book in 128Mb of memory which, despite me trying
>to keep the size of the buffer sane, caused swapping and took all night to
>complete.  This really sucked as it was back when I was still debugging the book
>code and I'd go to sleep expecting to wake up to a nice new book only to find an
>assertion fired or a crash in the PGN parser code.  As I'm sure you know with a
>500,000 game PGN file you can't verify it by hand... and there are all sorts of
>crazy text constucts in there that cause the PGN reader trouble.  (e.g. foriegn
>language piece letters, lines to seperate PGN games, tournament standings, email
>headers, etc.)  I have taken to simply skipping any PGN game with "errors" in it
>and writing out "error in game N at line M." as the book it being built.  There
>are a ton of these, maybe 1% of the 500,000.
>
>But with memory so cheap nowadays I decided it was worth it to go out and buy a
>little more...  Also if you are on windows and the process is running with admin
>privileges, look up VirtualLock.  This helps.  There's a FreeBSD equivalent
>called vmlock or something like that(?).
>
>>2) How much memory does the process take (size to RAM ratio).
>
>Unlike you commercial guys, I can assume that I can use the whole memory.  So I
>do... my buffer for creating the book is enormous so as to minimize the
>intermediate calls to the "strain" routine (which causes loss of information in
>the final book in order to speed up book generation).  So ratio of buffer size
>to total RAM is approx 9:10.
>
>Hope this helps,
>Scott

To give a quick answer:

Your RAM usage appears to be about the same as mine, but I find it very hard to
believe that you can fully process a half-million game file and create a 90MB
book file in 3 minutes. How is this possible? Just COPYING a 90MB file takes
about 30 seconds on my PIII-600. A half-million game PGN file is approximately
350MB, right?

But you're right about general memory availability. We really can't COUNT ON
more than 30-50MB free at any one time (although a more typical machine today
would probably have well over 100MB free). So, this limits us quite a bit.

My test was with a book with just under 150,000 games in it. It took about 250MB
of RAM (which ended up requiring about 100MB of swapping on my machine), and a
little less than 4 hours to process at a depth of 40 plies. The result (after
pruning zero-weight branches, which is, in effect, the same as your "straining"
process) was a file that was about 550K. If I had not pruned the zero-weight
branches, the file would have been about 6.1MB. Admittedly, though, this timing
is during a debugging process, and I have not actually tried running it with a
release build.

However, I think our PGN reader code is one of the main bottlenecks. It appears
to only be able to import about 100 games/second, and nobody else has reported
anything less than 2,000 games/second. That's probably something I should look
at.

Thanks for your input,

jm



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.