Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: To anyone who has ever automatically created an opening book

Author: John Merlino

Date: 11:35:35 11/10/01

Go up one level in this thread


On November 09, 2001 at 21:20:28, Dann Corbit wrote:

>On November 09, 2001 at 20:25:52, John Merlino wrote:
>
>>On November 09, 2001 at 15:44:07, Scott Gasch wrote:
>>
>>>On November 09, 2001 at 11:54:42, John Merlino wrote:
>>>
>>>Hi John.  I don't have hard numbers available without re-creating the book.
>>>Which I will do tonight when I am home from work... and report more details if
>>>you want them.  But here is my recollection...
>>>
>>>>1) In general, how long does it take to create a book (i.e. size to time ratio).
>>>
>>>I create monsoon's book from a PGN file with over 500,000 games in it.  (I got
>>>this file from the week in chess (WIC) archives, pitt.edu, and chessclub.com's
>>>GM library, and other Internet sources).  I have built the book on two machines
>>>-- a 450Mhz/576Mb FreeBSD machine and a 1.2Ghz/1Gb Win2k machine.  It takes
>>>about 10 minutes on the FreeBSD box.  It takes about 3 minutes on the NT box.
>>>The final book binary is about 90Mb in size but this is "processed", see below.
>>>
>>>My book building process is to allocate a huge buffer in memory and lock it in,
>>>so that it does not swap.  Then to use this buffer as a hash table.  The PGN
>>>parser reads a game and "plays" each move _up to move 25_ (ply 50).  Each
>>>position signature + next move is used as a hash key to create an entry in this
>>>huge memory buffer (or, if the entry is already there and we've seen this
>>>position+move before, to update that entry's win/draw/loss ratio).  I use linear
>>>probing in the event of a collision in the hash.
>>>
>>>When adding a new entry to the buffer I keep track of the buffer's fullness.
>>>If, during the creation process, the buffer ever gets over 90% full I run a
>>>"strainer" on the buffer.  This strainer goes through the memory and drops
>>>unpopular entries (starting at "seen this position+move only one time") until
>>>the percent-full drops below 90% again.  The 90% is arbitrary but I figured the
>>>linear probing would be getting pretty slow at 90% full.
>>>
>>>On the NT machine with 1Gb of memory, my buffer is large enough that the
>>>intermediate straining never has to happen.  On the FreeBSD machine it happens
>>>once, aroudn game 300,000 in the PGN file.
>>>
>>>When the PGN file is exhausted I do a final strain on the book (to eliminate
>>>positions I have seen only once in all 500,000 games).  I do this to keep the
>>>book binary on disk a reasonable size.  Then I sort the buffer on position
>>>signature with a quicksort.  Once its sorted I dump it out to disk.  That's my
>>>book.
>>>
>>>I originally was building the book in 128Mb of memory which, despite me trying
>>>to keep the size of the buffer sane, caused swapping and took all night to
>>>complete.  This really sucked as it was back when I was still debugging the book
>>>code and I'd go to sleep expecting to wake up to a nice new book only to find an
>>>assertion fired or a crash in the PGN parser code.  As I'm sure you know with a
>>>500,000 game PGN file you can't verify it by hand... and there are all sorts of
>>>crazy text constucts in there that cause the PGN reader trouble.  (e.g. foriegn
>>>language piece letters, lines to seperate PGN games, tournament standings, email
>>>headers, etc.)  I have taken to simply skipping any PGN game with "errors" in it
>>>and writing out "error in game N at line M." as the book it being built.  There
>>>are a ton of these, maybe 1% of the 500,000.
>>>
>>>But with memory so cheap nowadays I decided it was worth it to go out and buy a
>>>little more...  Also if you are on windows and the process is running with admin
>>>privileges, look up VirtualLock.  This helps.  There's a FreeBSD equivalent
>>>called vmlock or something like that(?).
>>>
>>>>2) How much memory does the process take (size to RAM ratio).
>>>
>>>Unlike you commercial guys, I can assume that I can use the whole memory.  So I
>>>do... my buffer for creating the book is enormous so as to minimize the
>>>intermediate calls to the "strain" routine (which causes loss of information in
>>>the final book in order to speed up book generation).  So ratio of buffer size
>>>to total RAM is approx 9:10.
>>>
>>>Hope this helps,
>>>Scott
>>
>>To give a quick answer:
>>
>>Your RAM usage appears to be about the same as mine, but I find it very hard to
>>believe that you can fully process a half-million game file and create a 90MB
>>book file in 3 minutes. How is this possible? Just COPYING a 90MB file takes
>>about 30 seconds on my PIII-600. A half-million game PGN file is approximately
>>350MB, right?
>>
>>But you're right about general memory availability. We really can't COUNT ON
>>more than 30-50MB free at any one time (although a more typical machine today
>>would probably have well over 100MB free). So, this limits us quite a bit.
>>
>>My test was with a book with just under 150,000 games in it. It took about 250MB
>>of RAM (which ended up requiring about 100MB of swapping on my machine), and a
>>little less than 4 hours to process at a depth of 40 plies. The result (after
>>pruning zero-weight branches, which is, in effect, the same as your "straining"
>>process) was a file that was about 550K. If I had not pruned the zero-weight
>>branches, the file would have been about 6.1MB. Admittedly, though, this timing
>>is during a debugging process, and I have not actually tried running it with a
>>release build.
>>
>>However, I think our PGN reader code is one of the main bottlenecks. It appears
>>to only be able to import about 100 games/second, and nobody else has reported
>>anything less than 2,000 games/second. That's probably something I should look
>>at.
>
>Why not store the data compressed (unless PGN format is wanted by the end user)?

The data is compressed. Each move takes 2 bytes. So the 150,000 PGN games was
reduced to about 550K.

The main problem is that we have to keep our default opening books pretty small,
as, once again, we have to think of our "minimum spec customer". Yet another
case of how we have to deal with things differently than everybody else here.

jm



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.