Author: Peter Klausler
Date: 16:34:43 01/23/98
Notes on CDB's implementation (1/23/98) - Peter Klausler This file describes some points of interest in the implementation of my free positional chess database program CDB for Win95/NT, which is available from http://reality.sgi.com/pmk_craypark. MEMORY MANAGEMENT CDB implements its database as a contiguous range of virtual memory. New unsaved databases are created by simply reserving a block of virtual memory; databases on disk are mapped into the process's address space. Since Windows 95's implementation of copy-on-write semantics is lame, CDB maps databases from disk into memory in read/write mode, and any changes to a database become instantly permanent. If I ran NT, I would consider adding copy-on-write to CDB and thereby be somewhat more protective of database integrity. Expansion of a database can be problematic for new unsaved databases, since the region of virtual memory immediately following the range in use may not be available. Expansion of saved databases sometimes requires that the file be unmapped, expanded, and remapped into a new and larger range. Within a database, all pointers are byte offsets from the beginning of the database, and are implemented with M$'s __based() feature. Conversions between based pointers and absolute pointers can be a fruitful source of bugs, as can comparisons with NULL. Since databases are simply regions of virtual memory, it is possible for CDB to map more than one at a time. This capability is presently used only when creating subset databases from gamesets, though a "merge" command also seems useful and may be added in a future version. CDB attempts to optimize for spatial and temporal locality by allocating memory in the database in pools of like objects. ORGANIZATION CDB comprises some basic chess routines (including PGN parsing and formatting), its database technology, and its user interface. The lines between the three areas are pretty clear, so that the basic chess stuff and the database technology could be placed in DLLs someday. CDB comprises about 20K lines of C++ code and header files. HASHING CDB uses two representations for chess positions: a fairly obvious scheme in which each square gets a byte to hold a code for the side and man kind, and a compressed form. The first scheme is used by routines that generate legal moves, parse SAN notation, and other chess things. The compressed position representation requires only 24 bytes but is basically useless for chess computation. The compressed chess position representation is interesting. It contains a 64-bit bit map of the occupied squares (organized by file rather than rank for better hash distribution), followed by a sequence of 4-bit codes. Each 4-bit code is a 1-bit color indicator and a 3-bit man kind. Castling rights are represented by using two different codes for rooks that can castle and rooks that can't. Pawns that can be captured on the next move via en-passant also have a distinguished code value. The side on the move is encoded in a very sneaky fashion. If the side not on the move has a pawn susceptible to en-passant capture, that alone suffices. If it doesn't, then an illegal en-passant pawn will appear (either on the wrong rank, or multiple en-passant pawns on the right rank). And if the side on move has no pawns, then a capture must have taken place in the game, and there is room in the compressed position representation for an additional 4-bit code that simply identifies the side on the move. Total space is therefore 8 bytes for the bit map and 32 x 4 bits for the piece codes, or 24 bytes total. Compressed positions are used in the database itself when an actual position needs to be in the database, butusually only the standard starting position has to be present. They are also used to compute hash codes. CDB has a hash function that generates two hash codes for a position: a primary hash code of 16 bits and a secondary hash code of 8 bits. The primary hash codes are used to index the 64K-entry hash table of the database. Each bucket in this table comprises a linked list of positions. Each position holds its secondary 8-bit hash code and a link to the next position with the same primary 16-bit hash code. To search the database for a position, CDB first compresses it and computes its hash codes. It then goes to the hash table and gets the list of positions corresponding to the primary hash code. CDB then traverses the list, looking for positions with matching secondary hash codes. Positions that match are then extracted from the database (see below) and compared with the original. There are cases in which CDB is searching the database for a position that is known in advance to be present. These searches can be much faster, for if only one position with a matching secondary hash is found in the list, the final comparison step can be omitted. POSITION/MOVE GRAPH CDB represents chess positions and the moves that link them. It has four different representations, which complicates the code enormously, but which allow large databases to be kept in physical memory without thrashing. The vast majority of positions are represented with the "Tiny" form. A Tiny position is one that has exactly one known incoming move and one known outgoing move. A Tiny position occupies only 12 bytes. It contains only a link to the preceding position, a 16-bit chess move code, some flags, and the hash table overhead. Reconstructing an actual chess position from a Tiny position is a recursive process: first reconstruct the single preceding position, then apply the incoming move. Positions that aren't Tiny are "Full" positions. There are three kinds of Full positions (Full, Big, and Huge) depending on the amount of extra information associated with the position. Each Full position is linked to two linked lists of Move records: one list of incoming moves, and one list of outgoing moves. A Move record represents a transition between any two positions that aren't both Tiny. It contains a 16-bit chess move code, pointers to the positions before and after the move, and two links (since each Move record is a member of one position's linked list of incoming moves and another (usually) position's linked list of outgoing moves). Move records also contain "crossover" bits that are mentioned below. Full positions contain counts of the number of games containing the position, the number of those games that White won, and the number of those games that Black won. These counters are 8 bits; if they overflow, the Full position is bloated into a Big position, which has 32-bit counters. A Big position is a subclass of a Full position that can contain a textual annotation of the position and a link to a Variation record, if it is the final position of one or more variations. And as noted above, a Big position has 32-bit game/White/Black counters. A Huge position is a subclass of a Big position that also contains a compressed chessboard position. Huge positions are required in the database for positions that aren't reconstructable from moves. The standard starting position is a Huge position. Huge positions occupy 64 bytes. For non-Huge positions, reconstructing the corresponding state of the chessboard requires that CDB traverse the graph back to a Huge position, copy the chessboard arrangement, and then apply all of the intervening moves. There are lots of opportunities for optimizing this expensive process, and CDB tries to exploit them. For example, when traversing forward in the database, CDB will maintain a copy of the current chessboard position, apply the moves it traverses, and hash it directly. The "File->Database Statistics" command displays an interesting summary of the contents of the database, which should be easily understood now that you've read up on the types of positional representations. CDB maintains a table of pointers to positions that have been identified as "flashcards". DICTIONARIES CDB maintains separate string dictionaries for players' names, event names, and other strings. The player name dictionary has a reasonably intelligent normalization algorithm and alias recognizer. "Robert Fischer", "Fischer, R.", and "Fischer, Robert J." are all recognized as the same player; however, "Bobby Fischer" is thought to be another person, and "Cray Blitz" is unfortunately renamed into "Blitz, Cray". Name normalization is a problem that a person could spend a whole lot of time trying to solve; CDB's capabilities catch most of the common cases. The user can now control the name normalization and aliasing capabilities manually. GAMES AND VARIATIONS CDB represents chess games by adding their positions and moves into the position/move graph, then noting the final position of the main variation. The moves of a line can be reconstructed by walking backward from the final position for a specific number of plies. This reverse traversal is guided at points where there is a choice of multiple incoming moves by special Disambiguation records that are stored with the Variation record. A Game record contains pointers to entries in the player, event, and general dictionaries, and a pointer to the main variation. A Variation record contains a count of the number of plies in the line, a link to the Game record, links to parent, child, and sibling variations, a list of Disambiguation records that permit reconstruction of the moves, and a link to the next variation record with the same final position. CDB maintains a directory that maps game serial numbers to the Game records themselves. Game serial numbers are used to identify games to the outside world and also in gamesets. CDB uses two schemes for collecting the set of games that contain a given position. The first, which has a large but fixed cost, is the straightforward approach of checking every game via reverse traversal. The second is a forward scan from the position in question in search of Big positions that link to main line Variation records. CDB will initially attempt the forward scan, but will abandon it if the game counts get to be five times larger than the game count of the starting position (due to transposition). Forward scans are somewhat optimized by the use of "crossover bits" in the Move records. Each Move record contains a little bit set that identifies, for the position "before" the move, the incoming moves that led to it. This prevents searches that arrive at positions with incoming transpositions from descending into fruitless outgoing moves. COMPRESSED VARIATIONS To save space and allow more games to be represented in a limited amount of physical memory, CDB now supports a hybrid representation. Under user control, positions that arise after a specified ply that constitute a linear sequence of moves without transpositions, alternatives, or annotations can be removed from the database and replaced with a very compact (1 byte/move) representation that is associated with the variation. Positions that have been removed from the database in this way are no longer accessible via the hash table. CDB may uncompress some moves when a new variation is entered into the database. CDB maintains an invariant rule that no compressed variation begins with a move that is redundantly represented in the usual ways (in a Tiny position or a full Move record) and will adjust compressed variations accordingly when new variations appear. The upshot of all this is that the database behaves normally from the point of view of the user (apart from zero values for compressed positions in the game counts) and occupies much less space (400B/game vs. 1000); however, transpositions into positions that have been compressed may not be detected, thereby forfeiting one of the great capabilities of a positional database program. For this reason, CDB does not compress positions by default before the 100'th ply. GAMESETS CDB provides gamesets as its primary means of searching the database and manipulating collections of games. Gamesets are identified by name. They link to the gamesets from which they were created, and also contain their criteria. The games constituting a gameset are stored as a linked list of contiguous integer ranges. When a new game is added to the database, CDB automatically adds its serial number to any gameset to which it should belong by checking the criteria of the gamesets against the tags of the game. CDB permits the tags of games to be edited; the implementation uses the crude approach of extracting the game from the database, deleting it, updating the tags, and saving it anew. USER INTERFACE CDB's visible interfaces are built upon the M$ Foundation Classes and the document/view architecture. Most operations that require significant time to execute take place in threads, so that the main window remains responsive. This code is all necessarily pretty ugly, but much of it was generated automatically. CRAFTY CDB can interface with Bob Hyatt's excellent and free chessplaying program, Crafty. CDB executes Crafty as a child process after establishing anonymous pipes for Crafty's standard input, standard output, and standard error files. Background threads in CDB monitor what comes out of Crafty's output and error pipes, looking for "kibitz" messages to parse for the extraction of scores and PVs. It turns out to be astonishingly difficult to abnormally terminate Crafty when it's in the middle of a long analysis. This isn't Bob Hyatt's fault; it's Bill Gates'. It turns out to be a generally bad thing in Win32 for one process to terminate another via the ::TerminateProcess () routine, since nearly everything in Win95 takes place in DLL's, and DLL's need to know when a process ends. Unfortunately, in Win32, it's a process' own responsibility to terminate cleanly (via ::ExitProcess ()). And about the only way to force another process to call ::ExitProcess () is interprocess communication, which Crafty would have to be modified to do, or using ::CreateRemoteThread () to start a new thread in Crafty that would just call ::ExitProcess (), which is appalling, or to send Crafty a signal through its console. And here's where I became convinced that Windows is evil! A process cannot send a signal to another process' console unless they share that console. And CDB, being a nice GUI application, doesn't have a console. So I now create a console for CDB when it initiates its first child process. But there's a really ugly problem with ::AllocConsole (); there's no way to minimize the console window that it creates! And I didn't want to require the poor user to minimize the black blank console window manually every time she ran CDB. Sure, if I just let Crafty create its own console, I can specify via ::CreateProcess () that it be minimized, but then CDB wouldn't share the console, which was the point of the whole exercise. I wasted an entire day off of work trying to work around this mess, trying things like sending a SC_MINIMIZE event to the console input buffer, and so on. A console is clearly a window, but the API gives the programmer no access to it; the documentation goes out of its way to tell you that you can't get to it! My eventual solution is horrible, but may be of value to other programmers. After creating the console with ::AllocConsole (), I set its title to an obscure string not likely to occur in nature. I then use the ::EnumWindows () routine to traverse all of the top-level windows on the display, looking for this string. Once I find it, I then use ::ShowWindow (SW_MINIMIZE) upon it. If you know of a better approach, I'd sure like to hear about it!
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.