Author: Dann Corbit
Date: 18:20:28 11/09/01
Go up one level in this thread
On November 09, 2001 at 20:25:52, John Merlino wrote: >On November 09, 2001 at 15:44:07, Scott Gasch wrote: > >>On November 09, 2001 at 11:54:42, John Merlino wrote: >> >>Hi John. I don't have hard numbers available without re-creating the book. >>Which I will do tonight when I am home from work... and report more details if >>you want them. But here is my recollection... >> >>>1) In general, how long does it take to create a book (i.e. size to time ratio). >> >>I create monsoon's book from a PGN file with over 500,000 games in it. (I got >>this file from the week in chess (WIC) archives, pitt.edu, and chessclub.com's >>GM library, and other Internet sources). I have built the book on two machines >>-- a 450Mhz/576Mb FreeBSD machine and a 1.2Ghz/1Gb Win2k machine. It takes >>about 10 minutes on the FreeBSD box. It takes about 3 minutes on the NT box. >>The final book binary is about 90Mb in size but this is "processed", see below. >> >>My book building process is to allocate a huge buffer in memory and lock it in, >>so that it does not swap. Then to use this buffer as a hash table. The PGN >>parser reads a game and "plays" each move _up to move 25_ (ply 50). Each >>position signature + next move is used as a hash key to create an entry in this >>huge memory buffer (or, if the entry is already there and we've seen this >>position+move before, to update that entry's win/draw/loss ratio). I use linear >>probing in the event of a collision in the hash. >> >>When adding a new entry to the buffer I keep track of the buffer's fullness. >>If, during the creation process, the buffer ever gets over 90% full I run a >>"strainer" on the buffer. This strainer goes through the memory and drops >>unpopular entries (starting at "seen this position+move only one time") until >>the percent-full drops below 90% again. The 90% is arbitrary but I figured the >>linear probing would be getting pretty slow at 90% full. >> >>On the NT machine with 1Gb of memory, my buffer is large enough that the >>intermediate straining never has to happen. On the FreeBSD machine it happens >>once, aroudn game 300,000 in the PGN file. >> >>When the PGN file is exhausted I do a final strain on the book (to eliminate >>positions I have seen only once in all 500,000 games). I do this to keep the >>book binary on disk a reasonable size. Then I sort the buffer on position >>signature with a quicksort. Once its sorted I dump it out to disk. That's my >>book. >> >>I originally was building the book in 128Mb of memory which, despite me trying >>to keep the size of the buffer sane, caused swapping and took all night to >>complete. This really sucked as it was back when I was still debugging the book >>code and I'd go to sleep expecting to wake up to a nice new book only to find an >>assertion fired or a crash in the PGN parser code. As I'm sure you know with a >>500,000 game PGN file you can't verify it by hand... and there are all sorts of >>crazy text constucts in there that cause the PGN reader trouble. (e.g. foriegn >>language piece letters, lines to seperate PGN games, tournament standings, email >>headers, etc.) I have taken to simply skipping any PGN game with "errors" in it >>and writing out "error in game N at line M." as the book it being built. There >>are a ton of these, maybe 1% of the 500,000. >> >>But with memory so cheap nowadays I decided it was worth it to go out and buy a >>little more... Also if you are on windows and the process is running with admin >>privileges, look up VirtualLock. This helps. There's a FreeBSD equivalent >>called vmlock or something like that(?). >> >>>2) How much memory does the process take (size to RAM ratio). >> >>Unlike you commercial guys, I can assume that I can use the whole memory. So I >>do... my buffer for creating the book is enormous so as to minimize the >>intermediate calls to the "strain" routine (which causes loss of information in >>the final book in order to speed up book generation). So ratio of buffer size >>to total RAM is approx 9:10. >> >>Hope this helps, >>Scott > >To give a quick answer: > >Your RAM usage appears to be about the same as mine, but I find it very hard to >believe that you can fully process a half-million game file and create a 90MB >book file in 3 minutes. How is this possible? Just COPYING a 90MB file takes >about 30 seconds on my PIII-600. A half-million game PGN file is approximately >350MB, right? > >But you're right about general memory availability. We really can't COUNT ON >more than 30-50MB free at any one time (although a more typical machine today >would probably have well over 100MB free). So, this limits us quite a bit. > >My test was with a book with just under 150,000 games in it. It took about 250MB >of RAM (which ended up requiring about 100MB of swapping on my machine), and a >little less than 4 hours to process at a depth of 40 plies. The result (after >pruning zero-weight branches, which is, in effect, the same as your "straining" >process) was a file that was about 550K. If I had not pruned the zero-weight >branches, the file would have been about 6.1MB. Admittedly, though, this timing >is during a debugging process, and I have not actually tried running it with a >release build. > >However, I think our PGN reader code is one of the main bottlenecks. It appears >to only be able to import about 100 games/second, and nobody else has reported >anything less than 2,000 games/second. That's probably something I should look >at. Why not store the data compressed (unless PGN format is wanted by the end user)? The Pigbase format for Amiga computers uses a wonderful scheme. It is the most compressed format I have ever seen, and you can look at the source to see what he did. Basically, moves are stored from the root position, and a new move requires only 1 character for each new unique move, because they just encode the move number <the nth move generated by the move generator>. Redundant moves need to be stored only once. It's also possible that the games could be read directly from a SCID database, which is compressed very well from PGN. This would enable very rapid reading, since 1/2 million games would fit in a small space. 2.6 million games takes only 1/2 gigabyte, including all index data, etc. under SCID. 10/22/2001 06:59p 332,675,508 junkbase.sg 10/22/2001 06:59p 108,195,930 junkbase.si 10/22/2001 06:26p 7,556,461 junkbase.sn SCID is also an open source project. Obviously, a commercial venture cannot use the SCID code without permission (for commercial purposes without "getting GPL'ed"). But you could look it over for clever ideas.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.