# Computer Chess Club Archives

## Messages

### Subject: Algorithm for Encoding Any Position in 155 bits (part 2/2)

Author: Ratko V Tomic

Date: 00:39:38 11/15/99

Go up one level in this thread

```--- message part 2/2 ---

3. HUFFMAN CODING OF MC26 INDEX
--------------------------------

So far we have obtained a variable length code for all chess positions
consisting of the fixed part MC26 code, specifying the material
content, and a variable length part MP(MC) (given as placement index I
in eq.(9) of the section 2.2), specifying the placement of the material
given by the MC. Since the worst case MP code has length of 142 whole
bits and MC26 code will take 26 whole bits, the worst case encoding so
far will use 168 bits.

On the other hand, since the total number of positions (accepted by
this algorithms) is N=2.46*10^46=154.1 bits, we can use Huffman coding
for the MC26 codes, which will produce uniform full code length of 155
whole bits. For a given MC26 "symbol" and its NP(MC) placements, we
assign a Huffman weight (or probability) w(MC)=NP(MC)/N. The sum of all
these weights over all 58,084,310 MC states is 1, i.e. they're a proper
Huffman weights. Therefore the Huffman code length of each of these
codes, in bits, is:

HL(MC) = -log2(w(MC)) = log2(N)-log2(NP(MC)) = 154.1 - log2(NP)  (11)

where the final result must be rounded to the next whole bit. Since,
for the given MC, the length of the MP part of the code is L(MP) =
log2(NP), the combined length of the full code HL(MC) + L(MP) = 154.1,
i.e. the full code is now a fixed length code, fitting in 155 whole
bits. The Huffman code lengths of the MC26 "symbols" are merely the
completion of the variable MP code length to the fixed 155 bits total.
Therefore our earlier worst case MP using 142 whole bits will have its
H(MC26) code using 154.1097-141.1382 = 12.9715 bits, or 13 whole bits.
The longest Huffman code for an MC26 symbol will occur for the shortest
MP case, which is the case of 2 lone kings, which will have (accounting
for checks) NP=4*60+24*58+36*55=3612 = 11.8186 bits, having thus HL(MC)
= 154.1097-11.8186 = 142.3 bits, i.e. 143 whole bits are used to
specify that there are only 2 kings on the board and the remaining 12
whole bits (the MP part of the code) specify where the kings are.

3.1 CONSTRUCTING HUFFMAN CODES
------------------------------

A compact and quick way to construct Huffman codes, knowing their
lengths in bits,  is to sort Huffman codes by their length and assign
sequential Huffman code numbers within each length group. In our case
we have N symbols to convert into Huffman codes (N=58,084,310 MC26
codes) and for each symbol MC1, MC2, ..., MCN we have a length L1,
L2,.. ,LN (obtained via (11), i.e. by completing the MP code length to
154.1 bits, then rounding the result up). The shortest code length was
13 whole bits, and the longest 143 whole bits, i.e. we have NL=131
fixed length groups: G1, G2, ..., G131, containing C1, C2,...,C131
codes respectively, and C1+C2+..+C131=N. The groups sizes, C1,C2,...
and the codes separation by lengths are obtained in a single pass
through the codes MC1, MC2,..MCN, assigning them each to its own bin
(out of 131 bins) by its length. We end up with 131 arrays, with
element counts C1, C2,.. C131, and containing MC26 symbols as their
elements.

(Note: all these steps are not part of encoding/decoding process but
belong to creation of the program's constants tables, hence their space
& time requirements are not part of the coder/decoder run-time space &
time requirements.)

We can now construct Huffman codes by starting with G1, which has HL=13
and the count of the 13 bit codes is C1. We simply assign the Huffman
codes to the MC26 symbols (stored in the bin G1), as sequential 13 bit
numbers from H0=0 to C1-1 (the Huffman codes satisfy Kraft inequality
for variable length codes, thus it will have C1<8192 i.e. C1-1 will not
be the largest 13 bit number, 0x1FFF). The we move to bin G2, with
length 14 bits (or more if no codes happen to be 14 bits long). We
increment the last code from bin 1, C1-1 to C1 and pad it with 0 bits
to the length of the Huffman codes in G2, creating thus starting
Huffman code H0 for the bin G2. The C2 Huffman codes are then
sequential C2 numbers of given length starting with H0. Then we
construct the starting code for the next group by using as its prefix
H0+C2 in the G2 bit length, then pad it with 0 bits to the bit length
for the bin G3, obtaining thus starting code H0 for the G3. We proceed
that way through entire list through G131.

For each group G1..G131 we don't need to keep Huffman codes for all
58,084,310 MC26 symbols, but only the starting Huffman code for that
group Gk, H0(k) and the count of codes Ck. In each group we also keep
the MC26 array of codes belonging to that length bin. So the entire
Huffman array will contain (in unoptimized form) around 58 million
entries. This can be easily cut in about half, to 29,045,304, by noting
that MC=(MC1,MC2) will belong in the same bin as the reversed color
state, MC'=(MC2,MC1), hence we only need to keep single MC26 code
wheneves MC1 is different than MC2 (which is true for all but 6298
cases).

3.2 ENCODING/DECODING HUFFMAN CODES
-----------------------------------

In addition to this space, we need a small decoder binary tree, which
has 131 leaf nodes, allowing decoder to find the prefix bin for
decoding given full length Huffman code HC. Once the bin is found (in
at most 8 compare branches), decoder constructs HCI=HC-H0 (arithmetic
precision in the bit length defined by that that bin), which is the
index in the array of the MC26 codes belonging to that length bin, and
retrieves the MC26 code via a single array lookup. The remaining part
of the full code (after stripping the Huffman part, whose length was
determined once its bin was found) is the MP index, which decodes as
described in the previous section.

The Huffman encoder doesn't need additional data, beyond the table
above. It receives MC26 code and computes its length (by calculating
number of placements, the NP value, and subtracting it from 154.1, then
rounding up) and within the located length bin it performs a binary
search on the stored MC26 codes (i.e. they have to be sorted within the
bin), finding thus the offset HCI which needs to be added to the base
Huffman code, H0,  for this bin to get the full Huffman code HC for
this MC26 code (i.e. HC=HC0+HCI).

3.3 REDUCING THE SIZE OF HUFFMAN TABLE
--------------------------------------

Our Huffman table with its 29 millions 26-bit numbers, although well
within capabilities of todays PC's, is still largely redundant. Namely,
we have already noted how exchanging black & white MC state doesn't
change the Huffman bin and that fact alone had cut in half the table
size. In fact the Huffman bin assignement of some MC26 code depended
only on the length of the Huffman code, which in turn depended only on
the number of placements, the NP number. But this number is insensitive
to great many other swaps. Consider an MC26 state which has 2 W.Rooks
and 3 W.Knights. That state has same NP as otherwise identical MC26
state, except that it has 3 W.Rooks and 2 W.Knights. In other words,
the formula for NP depends only on count values, not on which specific
piece has those counts (it was our arbitrary convention how to order
placement of the piece types). Therefore we can swap those counts among
themselves, producing the new MC26 states whenever we exchange 2
different count values, and still remain in the same Huffman bin.

Roughly speaking, if an MC record has counts {C1,C2,...,C12} for the
12 piece types, we need to store in the Huffman bin only the single
representative of all these permuted MC cases and use the permutation
index (which is mathematically equivalent to our placement index
construction in section 2) to compute Huffman codes for all the
remaining records obtained through the permutations of the
representative record. Now, due to the legality constraints (for
promotions vs captures), as well as the special treatment of pawns in
computing the NP, we have to observe some constraints on these
permutation among the piece type counts:

C1. Pawn counts cannot be exchanged with nonpawn counts

C2. To preserve excess & defect values for each side, as they
appear in the constraints (1w) and (1b), we cannot exchange
any king or any Queen with another piece (since K doesn't have
excess or defect, and Queen has different excess/defect from R,
N or B with the same count).

Consequently, we will only swap R, N, B among themselves and r,n,b
among themselves, and black & white as a whole. More complicated checks
on the legality could utilize greater degree of symmetry i.e. they
could generate many more swaps among the counts which preserve material
(il)legality status and which keep the NP unchanged. But even these few
simple symmetry cases reduce the 58 million entry table down to
1,839,500 representative entries for the MC26 codes (this count was
produced by the program which counted the positions in the initial post
of this thread). The representative entries would always have RNB and
rnb counts ordered in non-decreasing order. Neither the permuted MC
codes, nor the number of such permutations need to be stored in the
table (since both are quickly computable from the representative state,
or can be looked up even quicker in small precomputed tables), and the
representative case can thus act to the decoder & encoder exactly as
the fully expanded permutations (at some loss of speed).

Note 1: In section 1 an approximate method was proposed to avoid using
the 58 million entry table of the MC26 states (at the expense of 0.096
bits). The symmetry method with representative cases, as described
here, is an alternative, which at the expense of 1.8 million entry
table (which ends up being used by the Huffman encoder/decoder anyway)
can be made to work without any approximation (but a bit slower).

Note 2: The ep and castle status, which on average have a very low
bit overhead (in fractions of a bit), due to the uniform final code
length achieved via Huffman coding, are reducable to their average
overhead in all cases (as opposed to generating no overhead in most
cases and generating several bits of overhead in the "worst" cases,
which appeared in various ad hoc encoding methods discussed earlier).

For example, within this method, the king-rooks factor KR in the MP
index separates in the KR=KR0+KRC, where KR0 are the total placements
without the castle rights (regardless of the placement of kings &
rooks) and KRC are the 15 castle right cases (KQkq) for a "tiny" subset
of the placements which have one or 2 kings and one or more rooks fixed
in the right places (I said "tiny" since fixing at least 2 pieces to
make a position a castle candidate,  reduces the so constrained
placements by at least a quadratic factor in number of empty squares
NS, i.e. roughly by NS^2, where NS is at least 32, so KR/KR0 is of the
order 1.001, adding log2(1.001) = 0.001 bits to the code). Similar size
estimates apply to the ep cases, where the pawn factor in MP is
PF=PF0+PFE, and the PFE containing the EP cases, is a tiny fraction of
the uncosntrained pawn placements, PF0 (although these are somewhat more
cumbersome to classify than the castle cases). Since our full code
length was 154.1, and the 155 bit was a whole bit code size, it seems
quite likely that the ep+castle corrections wouldn't tip it over the
155 whole bit boundary (and of course, the full code would still
retain its uniform length property over all positions).

```