Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: To anyone who has ever automatically created an opening book

Author: Scott Gasch

Date: 12:44:07 11/09/01

Go up one level in this thread


On November 09, 2001 at 11:54:42, John Merlino wrote:

Hi John.  I don't have hard numbers available without re-creating the book.
Which I will do tonight when I am home from work... and report more details if
you want them.  But here is my recollection...

>1) In general, how long does it take to create a book (i.e. size to time ratio).

I create monsoon's book from a PGN file with over 500,000 games in it.  (I got
this file from the week in chess (WIC) archives, pitt.edu, and chessclub.com's
GM library, and other Internet sources).  I have built the book on two machines
-- a 450Mhz/576Mb FreeBSD machine and a 1.2Ghz/1Gb Win2k machine.  It takes
about 10 minutes on the FreeBSD box.  It takes about 3 minutes on the NT box.
The final book binary is about 90Mb in size but this is "processed", see below.

My book building process is to allocate a huge buffer in memory and lock it in,
so that it does not swap.  Then to use this buffer as a hash table.  The PGN
parser reads a game and "plays" each move _up to move 25_ (ply 50).  Each
position signature + next move is used as a hash key to create an entry in this
huge memory buffer (or, if the entry is already there and we've seen this
position+move before, to update that entry's win/draw/loss ratio).  I use linear
probing in the event of a collision in the hash.

When adding a new entry to the buffer I keep track of the buffer's fullness.
If, during the creation process, the buffer ever gets over 90% full I run a
"strainer" on the buffer.  This strainer goes through the memory and drops
unpopular entries (starting at "seen this position+move only one time") until
the percent-full drops below 90% again.  The 90% is arbitrary but I figured the
linear probing would be getting pretty slow at 90% full.

On the NT machine with 1Gb of memory, my buffer is large enough that the
intermediate straining never has to happen.  On the FreeBSD machine it happens
once, aroudn game 300,000 in the PGN file.

When the PGN file is exhausted I do a final strain on the book (to eliminate
positions I have seen only once in all 500,000 games).  I do this to keep the
book binary on disk a reasonable size.  Then I sort the buffer on position
signature with a quicksort.  Once its sorted I dump it out to disk.  That's my
book.

I originally was building the book in 128Mb of memory which, despite me trying
to keep the size of the buffer sane, caused swapping and took all night to
complete.  This really sucked as it was back when I was still debugging the book
code and I'd go to sleep expecting to wake up to a nice new book only to find an
assertion fired or a crash in the PGN parser code.  As I'm sure you know with a
500,000 game PGN file you can't verify it by hand... and there are all sorts of
crazy text constucts in there that cause the PGN reader trouble.  (e.g. foriegn
language piece letters, lines to seperate PGN games, tournament standings, email
headers, etc.)  I have taken to simply skipping any PGN game with "errors" in it
and writing out "error in game N at line M." as the book it being built.  There
are a ton of these, maybe 1% of the 500,000.

But with memory so cheap nowadays I decided it was worth it to go out and buy a
little more...  Also if you are on windows and the process is running with admin
privileges, look up VirtualLock.  This helps.  There's a FreeBSD equivalent
called vmlock or something like that(?).

>2) How much memory does the process take (size to RAM ratio).

Unlike you commercial guys, I can assume that I can use the whole memory.  So I
do... my buffer for creating the book is enormous so as to minimize the
intermediate calls to the "strain" routine (which causes loss of information in
the final book in order to speed up book generation).  So ratio of buffer size
to total RAM is approx 9:10.

Hope this helps,
Scott



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.