Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: To anyone who has ever automatically created an opening book

Author: Dann Corbit

Date: 18:20:28 11/09/01

Go up one level in this thread


On November 09, 2001 at 20:25:52, John Merlino wrote:

>On November 09, 2001 at 15:44:07, Scott Gasch wrote:
>
>>On November 09, 2001 at 11:54:42, John Merlino wrote:
>>
>>Hi John.  I don't have hard numbers available without re-creating the book.
>>Which I will do tonight when I am home from work... and report more details if
>>you want them.  But here is my recollection...
>>
>>>1) In general, how long does it take to create a book (i.e. size to time ratio).
>>
>>I create monsoon's book from a PGN file with over 500,000 games in it.  (I got
>>this file from the week in chess (WIC) archives, pitt.edu, and chessclub.com's
>>GM library, and other Internet sources).  I have built the book on two machines
>>-- a 450Mhz/576Mb FreeBSD machine and a 1.2Ghz/1Gb Win2k machine.  It takes
>>about 10 minutes on the FreeBSD box.  It takes about 3 minutes on the NT box.
>>The final book binary is about 90Mb in size but this is "processed", see below.
>>
>>My book building process is to allocate a huge buffer in memory and lock it in,
>>so that it does not swap.  Then to use this buffer as a hash table.  The PGN
>>parser reads a game and "plays" each move _up to move 25_ (ply 50).  Each
>>position signature + next move is used as a hash key to create an entry in this
>>huge memory buffer (or, if the entry is already there and we've seen this
>>position+move before, to update that entry's win/draw/loss ratio).  I use linear
>>probing in the event of a collision in the hash.
>>
>>When adding a new entry to the buffer I keep track of the buffer's fullness.
>>If, during the creation process, the buffer ever gets over 90% full I run a
>>"strainer" on the buffer.  This strainer goes through the memory and drops
>>unpopular entries (starting at "seen this position+move only one time") until
>>the percent-full drops below 90% again.  The 90% is arbitrary but I figured the
>>linear probing would be getting pretty slow at 90% full.
>>
>>On the NT machine with 1Gb of memory, my buffer is large enough that the
>>intermediate straining never has to happen.  On the FreeBSD machine it happens
>>once, aroudn game 300,000 in the PGN file.
>>
>>When the PGN file is exhausted I do a final strain on the book (to eliminate
>>positions I have seen only once in all 500,000 games).  I do this to keep the
>>book binary on disk a reasonable size.  Then I sort the buffer on position
>>signature with a quicksort.  Once its sorted I dump it out to disk.  That's my
>>book.
>>
>>I originally was building the book in 128Mb of memory which, despite me trying
>>to keep the size of the buffer sane, caused swapping and took all night to
>>complete.  This really sucked as it was back when I was still debugging the book
>>code and I'd go to sleep expecting to wake up to a nice new book only to find an
>>assertion fired or a crash in the PGN parser code.  As I'm sure you know with a
>>500,000 game PGN file you can't verify it by hand... and there are all sorts of
>>crazy text constucts in there that cause the PGN reader trouble.  (e.g. foriegn
>>language piece letters, lines to seperate PGN games, tournament standings, email
>>headers, etc.)  I have taken to simply skipping any PGN game with "errors" in it
>>and writing out "error in game N at line M." as the book it being built.  There
>>are a ton of these, maybe 1% of the 500,000.
>>
>>But with memory so cheap nowadays I decided it was worth it to go out and buy a
>>little more...  Also if you are on windows and the process is running with admin
>>privileges, look up VirtualLock.  This helps.  There's a FreeBSD equivalent
>>called vmlock or something like that(?).
>>
>>>2) How much memory does the process take (size to RAM ratio).
>>
>>Unlike you commercial guys, I can assume that I can use the whole memory.  So I
>>do... my buffer for creating the book is enormous so as to minimize the
>>intermediate calls to the "strain" routine (which causes loss of information in
>>the final book in order to speed up book generation).  So ratio of buffer size
>>to total RAM is approx 9:10.
>>
>>Hope this helps,
>>Scott
>
>To give a quick answer:
>
>Your RAM usage appears to be about the same as mine, but I find it very hard to
>believe that you can fully process a half-million game file and create a 90MB
>book file in 3 minutes. How is this possible? Just COPYING a 90MB file takes
>about 30 seconds on my PIII-600. A half-million game PGN file is approximately
>350MB, right?
>
>But you're right about general memory availability. We really can't COUNT ON
>more than 30-50MB free at any one time (although a more typical machine today
>would probably have well over 100MB free). So, this limits us quite a bit.
>
>My test was with a book with just under 150,000 games in it. It took about 250MB
>of RAM (which ended up requiring about 100MB of swapping on my machine), and a
>little less than 4 hours to process at a depth of 40 plies. The result (after
>pruning zero-weight branches, which is, in effect, the same as your "straining"
>process) was a file that was about 550K. If I had not pruned the zero-weight
>branches, the file would have been about 6.1MB. Admittedly, though, this timing
>is during a debugging process, and I have not actually tried running it with a
>release build.
>
>However, I think our PGN reader code is one of the main bottlenecks. It appears
>to only be able to import about 100 games/second, and nobody else has reported
>anything less than 2,000 games/second. That's probably something I should look
>at.

Why not store the data compressed (unless PGN format is wanted by the end user)?

The Pigbase format for Amiga computers uses a wonderful scheme.
It is the most compressed format I have ever seen, and you can look at the
source to see what he did.  Basically, moves are stored from the root position,
and a new move requires only 1 character for each new unique move, because they
just encode the move number <the nth move generated by the move generator>.
Redundant moves need to be stored only once.

It's also possible that the games could be read directly from a SCID database,
which is compressed very well from PGN.  This would enable very rapid reading,
since 1/2 million games would fit in a small space.  2.6 million games takes
only 1/2 gigabyte, including all index data, etc. under SCID.

10/22/2001  06:59p         332,675,508 junkbase.sg
10/22/2001  06:59p         108,195,930 junkbase.si
10/22/2001  06:26p           7,556,461 junkbase.sn

SCID is also an open source project.  Obviously, a commercial venture cannot use
the SCID code without permission (for commercial purposes without "getting
GPL'ed").  But you could look it over for clever ideas.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.