Computer Chess Club Archives


Search

Terms

Messages

Subject: CDB implementation notes FYI

Author: Peter Klausler

Date: 16:34:43 01/23/98


Notes on CDB's implementation (1/23/98) - Peter Klausler

This file describes some points of interest in the implementation
of my free positional chess database program CDB for Win95/NT,
which is available from http://reality.sgi.com/pmk_craypark.


MEMORY MANAGEMENT

CDB implements its database as a contiguous range of virtual
memory.  New unsaved databases are created by simply reserving
a block of virtual memory; databases on disk are mapped into
the process's address space.  Since Windows 95's implementation
of copy-on-write semantics is lame, CDB maps databases from
disk into memory in read/write mode, and any changes to
a database become instantly permanent.  If I ran NT, I would
consider adding copy-on-write to CDB and thereby be somewhat
more protective of database integrity.

Expansion of a database can be problematic for new unsaved
databases, since the region of virtual memory immediately
following the range in use may not be available.  Expansion
of saved databases sometimes requires that the file be
unmapped, expanded, and remapped into a new and larger range.

Within a database, all pointers are byte offsets from the
beginning of the database, and are implemented with M$'s
__based() feature.  Conversions between based pointers and
absolute pointers can be a fruitful source of bugs, as can
comparisons with NULL.

Since databases are simply regions of virtual memory,
it is possible for CDB to map more than one at a time.
This capability is presently used only when creating
subset databases from gamesets, though a "merge" command
also seems useful and may be added in a future version.

CDB attempts to optimize for spatial and temporal locality
by allocating memory in the database in pools of like
objects.


ORGANIZATION

CDB comprises some basic chess routines (including PGN
parsing and formatting), its database technology, and
its user interface.  The lines between the three areas
are pretty clear, so that the basic chess stuff and the
database technology could be placed in DLLs someday.
CDB comprises about 20K lines of C++ code and header
files.


HASHING

CDB uses two representations for chess positions: a fairly
obvious scheme in which each square gets a byte to hold a
code for the side and man kind, and a compressed form.
The first scheme is used by routines that generate legal
moves, parse SAN notation, and other chess things.
The compressed position representation requires only 24 bytes
but is basically useless for chess computation.

The compressed chess position representation is interesting.
It contains a 64-bit bit map of the occupied squares (organized
by file rather than rank for better hash distribution), followed
by a sequence of 4-bit codes.  Each 4-bit code is a 1-bit
color indicator and a 3-bit man kind.  Castling rights are
represented by using two different codes for rooks that can
castle and rooks that can't.  Pawns that can be captured on
the next move via en-passant also have a distinguished code
value.  The side on the move is encoded in a very sneaky
fashion.  If the side not on the move has a pawn susceptible
to en-passant capture, that alone suffices.  If it doesn't,
then an illegal en-passant pawn will appear (either on the
wrong rank, or multiple en-passant pawns on the right rank).
And if the side on move has no pawns, then a capture must have
taken place in the game, and there is room in the compressed
position representation for an additional 4-bit code that
simply identifies the side on the move.  Total space is
therefore 8 bytes for the bit map and 32 x 4 bits for the
piece codes, or 24 bytes total.

Compressed positions are used in the database itself when
an actual position needs to be in the database, butusually only
the standard starting position has to be present.  They
are also used to compute hash codes.  CDB has a hash function
that generates two hash codes for a position: a primary hash
code of 16 bits and a secondary hash code of 8 bits.

The primary hash codes are used to index the 64K-entry hash
table of the database.  Each bucket in this table comprises
a linked list of positions.  Each position holds its
secondary 8-bit hash code and a link to the next position
with the same primary 16-bit hash code.

To search the database for a position, CDB first compresses
it and computes its hash codes.  It then goes to the hash
table and gets the list of positions corresponding to the
primary hash code.  CDB then traverses the list, looking for
positions with matching secondary hash codes.  Positions
that match are then extracted from the database (see below)
and compared with the original.

There are cases in which CDB is searching the database for
a position that is known in advance to be present.  These
searches can be much faster, for if only one position with
a matching secondary hash is found in the list, the final
comparison step can be omitted.


POSITION/MOVE GRAPH

CDB represents chess positions and the moves that link them.
It has four different representations, which complicates the
code enormously, but which allow large databases to be
kept in physical memory without thrashing.

The vast majority of positions are represented with the
"Tiny" form.  A Tiny position is one that has exactly
one known incoming move and one known outgoing move.
A Tiny position occupies only 12 bytes.  It contains
only a link to the preceding position, a 16-bit chess
move code, some flags, and the hash table overhead.
Reconstructing an actual chess position from a Tiny
position is a recursive process: first reconstruct
the single preceding position, then apply the incoming
move.

Positions that aren't Tiny are "Full" positions.  There
are three kinds of Full positions (Full, Big, and Huge)
depending on the amount of extra information associated
with the position.  Each Full position is linked to
two linked lists of Move records: one list of incoming
moves, and one list of outgoing moves.

A Move record represents a transition between any two
positions that aren't both Tiny.  It contains a 16-bit chess
move code, pointers to the positions before and after
the move, and two links (since each Move record is a
member of one position's linked list of incoming moves
and another (usually) position's linked list of outgoing
moves).  Move records also contain "crossover" bits
that are mentioned below.

Full positions contain counts of the number of games
containing the position, the number of those games
that White won, and the number of those games that
Black won.  These counters are 8 bits; if they
overflow, the Full position is bloated into a Big
position, which has 32-bit counters.

A Big position is a subclass of a Full position that
can contain a textual annotation of the position and
a link to a Variation record, if it is the final position
of one or more variations.  And as noted above,
a Big position has 32-bit game/White/Black counters.

A Huge position is a subclass of a Big position that
also contains a compressed chessboard position.  Huge
positions are required in the database for positions
that aren't reconstructable from moves.  The standard
starting position is a Huge position.  Huge positions
occupy 64 bytes.

For non-Huge positions, reconstructing the corresponding
state of the chessboard requires that CDB traverse the graph
back to a Huge position, copy the chessboard arrangement,
and then apply all of the intervening moves.  There are
lots of opportunities for optimizing this expensive
process, and CDB tries to exploit them.  For example,
when traversing forward in the database, CDB will
maintain a copy of the current chessboard position, apply
the moves it traverses, and hash it directly.

The "File->Database Statistics" command displays an
interesting summary of the contents of the database,
which should be easily understood now that you've
read up on the types of positional representations.

CDB maintains a table of pointers to positions that have
been identified as "flashcards".


DICTIONARIES

CDB maintains separate string dictionaries for players' names,
event names, and other strings.  The player name dictionary
has a reasonably intelligent normalization algorithm and
alias recognizer.  "Robert Fischer", "Fischer, R.", and
"Fischer, Robert J." are all recognized as the same player;
however, "Bobby Fischer" is thought to be another person,
and "Cray Blitz" is unfortunately renamed into "Blitz, Cray".
Name normalization is a problem that a person could spend
a whole lot of time trying to solve; CDB's capabilities
catch most of the common cases.  The user can now control
the name normalization and aliasing capabilities manually.


GAMES AND VARIATIONS

CDB represents chess games by adding their positions and
moves into the position/move graph, then noting the
final position of the main variation.  The moves of a line
can be reconstructed by walking backward from the final
position for a specific number of plies.  This reverse
traversal is guided at points where there is a choice
of multiple incoming moves by special Disambiguation
records that are stored with the Variation record.

A Game record contains pointers to entries in the player,
event, and general dictionaries, and a pointer to the main
variation.  A Variation record contains a count of the number of
plies in the line, a link to the Game record, links to parent,
child, and sibling variations, a list of Disambiguation records
that permit reconstruction of the moves, and a link to the next
variation record with the same final position.

CDB maintains a directory that maps game serial numbers to
the Game records themselves.  Game serial numbers are used
to identify games to the outside world and also in gamesets.

CDB uses two schemes for collecting the set of games that
contain a given position.  The first, which has a large but
fixed cost, is the straightforward approach of checking
every game via reverse traversal.  The second is a forward
scan from the position in question in search of Big positions
that link to main line Variation records.  CDB will initially attempt
the forward scan, but will abandon it if the game counts
get to be five times larger than the game count of the
starting position (due to transposition).

Forward scans are somewhat optimized by the use of "crossover
bits" in the Move records.  Each Move record contains a little
bit set that identifies, for the position "before" the move,
the incoming moves that led to it.  This prevents searches
that arrive at positions with incoming transpositions from
descending into fruitless outgoing moves.


COMPRESSED VARIATIONS

To save space and allow more games to be represented in a
limited amount of physical memory, CDB now supports a
hybrid representation.  Under user control, positions
that arise after a specified ply that constitute a
linear sequence of moves without transpositions,
alternatives, or annotations can be removed from the
database and replaced with a very compact (1 byte/move)
representation that is associated with the variation.

Positions that have been removed from the database in this
way are no longer accessible via the hash table.  CDB
may uncompress some moves when a new variation is entered
into the database.  CDB maintains an invariant rule that
no compressed variation begins with a move that is
redundantly represented in the usual ways (in a Tiny position
or a full Move record) and will adjust compressed variations
accordingly when new variations appear.  The upshot
of all this is that the database behaves normally from
the point of view of the user (apart from zero values for
compressed positions in the game counts) and occupies much
less space (400B/game vs. 1000); however, transpositions
into positions that have been compressed may not be detected,
thereby forfeiting one of the great capabilities of a
positional database program.  For this reason, CDB does
not compress positions by default before the 100'th ply.


GAMESETS

CDB provides gamesets as its primary means of searching the
database and manipulating collections of games.  Gamesets
are identified by name.  They link to the gamesets from which
they were created, and also contain their criteria.  The
games constituting a gameset are stored as a linked list of
contiguous integer ranges.  When a new game is added to the
database, CDB automatically adds its serial number to any
gameset to which it should belong by checking the criteria
of the gamesets against the tags of the game.

CDB permits the tags of games to be edited; the implementation
uses the crude approach of extracting the game from the
database, deleting it, updating the tags, and saving it anew.


USER INTERFACE

CDB's visible interfaces are built upon the M$ Foundation Classes
and the document/view architecture.  Most operations that
require significant time to execute take place in threads,
so that the main window remains responsive.  This code
is all necessarily pretty ugly, but much of it was generated
automatically.


CRAFTY

CDB can interface with Bob Hyatt's excellent and free chessplaying
program, Crafty.  CDB executes Crafty as a child process after
establishing anonymous pipes for Crafty's standard input, standard
output, and standard error files.  Background threads in CDB
monitor what comes out of Crafty's output and error pipes,
looking for "kibitz" messages to parse for the extraction
of scores and PVs.

It turns out to be astonishingly difficult to abnormally
terminate Crafty when it's in the middle of a long analysis.
This isn't Bob Hyatt's fault; it's Bill Gates'.  It turns
out to be a generally bad thing in Win32 for one process to
terminate another via the ::TerminateProcess () routine,
since nearly everything in Win95 takes place in DLL's,
and DLL's need to know when a process ends.  Unfortunately,
in Win32, it's a process' own responsibility to terminate
cleanly (via ::ExitProcess ()).  And about the only way
to force another process to call ::ExitProcess () is
interprocess communication, which Crafty would have to be
modified to do, or using ::CreateRemoteThread () to
start a new thread in Crafty that would just call ::ExitProcess (),
which is appalling, or to send Crafty a signal through its
console.

And here's where I became convinced that Windows is evil!
A process cannot send a signal to another process' console
unless they share that console.  And CDB, being a nice GUI
application, doesn't have a console.  So I now create
a console for CDB when it initiates its first child
process.  But there's a really ugly problem with ::AllocConsole ();
there's no way to minimize the console window that it creates!
And I didn't want to require the poor user to minimize the
black blank console window manually every time she ran CDB.
Sure, if I just let Crafty create its own console, I can
specify via ::CreateProcess () that it be minimized, but
then CDB wouldn't share the console, which was the point
of the whole exercise.  I wasted an entire day off of work
trying to work around this mess, trying things like
sending a SC_MINIMIZE event to the console input buffer,
and so on.  A console is clearly a window, but the API gives
the programmer no access to it; the documentation goes out of
its way to tell you that you can't get to it!

My eventual solution is horrible, but may be
of value to other programmers.  After creating the console
with ::AllocConsole (), I set its title to an obscure
string not likely to occur in nature.  I then use the
::EnumWindows () routine to traverse all of the top-level
windows on the display, looking for this string.  Once I
find it, I then use ::ShowWindow (SW_MINIMIZE) upon it.
If you know of a better approach, I'd sure like to hear
about it!



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.