Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: SAN problem with Crafty's wall.pgn book

Author: Robert Hyatt

Date: 14:09:29 06/19/98

Go up one level in this thread


On June 19, 1998 at 11:04:52, Johan Melin wrote:

>On June 19, 1998 at 00:30:13, Robert Hyatt wrote:
>
>>On June 18, 1998 at 19:58:03, Edward Screven wrote:
>>
>>>On June 18, 1998 at 17:32:53, Robert Hyatt wrote:
>>>
>>>>On June 18, 1998 at 16:50:19, Edward Screven wrote:
>>>>
>>>>>my understanding of the crafty book building procedure is that
>>>>>you scan a pgn input file, streaming <position,win,loss,draw>
>>>>>records through an aggregating sort, and it's the disk sort
>>>>>runs that require lots of space.
>>>>>
>>>>>if this is correct, then a simple way to reduce your temporary
>>>>>space requirements by 1/N, at a cost of making N passes over
>>>>>the pgn input, is to partition the position keys into N equal
>>>>>sized ranges.  make N passes over the input pgn file.  on the
>>>>>i-th pass, discard all positions which are not in the i-th range.
>>>>>the independently sorted results of each pass can be appended to
>>>>>your final output file.
>>>>>
>>>>>  - edward
>>>>
>>>>
>>>>here's what happens.  I first parse the pgn and output (now) a 9
>>>>byte record for each move, 8 byte hash signature, 1 byte with result
>>>>of the game (3 bits) and the !/?/etc flags (5 bits).  This is streamed
>>>>out to a file.
>>>>
>>>>I then read this back in a huge chunk at a time, into a memory buffer,
>>>>call qsort() to sort (not a disk sort) and then save each of these
>>>>chunks in a separate file.  Just as I finish this, I have two copies
>>>>of everything (note I now write 9 byte records, I was writing 20 byte
>>>>records [linux] or 24 [windows]).
>>>>
>>>>I now delete the original unsorted input, then do a simple N-input
>>>>merge and write the book out with some indices on the front to give me
>>>>quick access.
>>>
>>>got it.  the next step you might consider is eliminating the first
>>>intermediate file.  instead of streaming the unordered records to
>>>a file, write them directly into the memory sort buffer.  each time
>>>the buffer fills, sort it and spill it to disk as a run.  merge and
>>>aggregate as before. that should cut the disk space required in half.
>>>
>>>if you want to try the key range partioning trick, you would apply
>>>it at the point you write the records into the memory buffer -- just
>>>discard the records that aren't part of the current pass.
>>>
>>>  - edward
>>
>>
>>it actually doesn't change a thing, because I first have to sort/write,
>>which takes O(N) space, then I have to merge, and during the merge, I
>>need O(2N) space.
>>
>>So the space doesn't change, although I have been working today on doing
>>the sort at the parsing step as you just suggested, because I wanted to
>>dump the extra I/O going on...
>
>In the intermidate files you got several identical positions occuring over and
>over .. but the identical ones will only be one entry in the book. And those
>that occur less than 3 times doesn't make it to the book at all.
>
>If I understand Edward correctly, he is suggesting that when you parse the PGN
>you only output the positions with a key that starts with, say 00001..., put
>those into memory, sort, eliminate doubles and those that only occure ones or
>twice. Then you append that to the bookfile. And that stuff wount need furter
>proceccing.
>
>Then you go on with a next pass taking only those that start with 00010,
>appending the output to the book ...
>
>Divide it into suffisiently many passes, and the intermidiate stuff wouldn't
>have to be stored on disk at all ...
>
>/Johan

This turns an O(N) operation (cpu complexity) and O(2N)  operation (Space)
into an O(N^2) (complexity) to get to O(N) space.  I don't like the trade
off.  I can build a 350,000 game book with 25,000,000 moves, in under 40
minutes on my P6, and under 10 minutes on a good alpha.  The factor of two
in space isn't worth the factor of N in speed, because N is *big*.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.