Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Hash Collisions

Author: Rémi Coulom

Date: 02:59:24 02/23/99

Go up one level in this thread


On February 23, 1999 at 04:12:42, Dan Newman wrote:

>On February 22, 1999 at 14:12:58, Robert Hyatt wrote:
>
>>On February 22, 1999 at 11:14:14, Robert Hyatt wrote:
>>
[...]
>>
>>I ran some quick tests... to produce a minimum distance of 16, producing
>>768 random numbers (64 bits), takes 769 numbers from my Random64() function
>>in crafty.  To go to 24 jumps this to a really big number (it took my code
>>200 million random numbers to find 768 with a hamming distance >= 24 between
>>_all_ of them.
>>
>>One minor trick is that in general, if N is a good number, ~N is also good.
>>
>
>I wonder if this is the case?  Say you have a bunch of numbers A, B, C, ...
>and also ~A, ~B, ~C, ...,  then A ^ ~A == B ^ ~B == C ^ ~C == ~0.  It seems
>like a less than desirable result to get the same bit pattern from pairwise
>combinations of these numbers: you are more likely to get the same hashcode
>for different positions.
>
>What you really want to do is select a bunch of numbers that have as large
>a degree of "linear" independence as you can manage.  You get maximal
>independence when each number has a different bit turned on:
>
>    0000000000000000000000000000000000000000000000000000000000000001
>    0000000000000000000000000000000000000000000000000000000000000010
>    0000000000000000000000000000000000000000000000000000000000000100
>                              ...
>    1000000000000000000000000000000000000000000000000000000000000000
>
>but of course that leaves you with too few numbers.  So you must have
>more bits turned on.  That means that it will be possible to express some
>members of this set of numbers as a "linear" combination of other
>members:
>
>        X = A ^ C ^ F ^ G ^ ...
>
>You really want to make that as difficult as possible.  If members of the
>set of numbers have too many bits in common (close hamming distance), then
>you will have trouble -- it becomes "easier" to get the same hashcode from
>different combinations of numbers.  But the same is true if the set contains
>members that have too few bits in common (as when you include complements).
>My intuition tells me that trying for a hamming distance of 32 is ideal --
>but I don't really know...  (I use a sieve that keeps the hamming distance
>between 16 and 48 for each pairing.  It gets very slow near the end...)
[...]

I totally agree with Dan here. I also read about maximizing the minimun Hamming
distance between hash keys, but it does not sound like a good idea to me. I did
not read the ICCA Journal article mentionned earlier, but JC Weill talks about
the theory of linear error-correcting codes in his PhD thesis, and using
Bose-Chaudhuri-Hocquenghem codes for hash keys. I took a very long time trying
to understand what they were. I did not succeed yet, but I have good hopes to
understand them one day.

In the theory of error control codes, maximizing the minimun Hamming distance
between hash keys is a way to maximize the error-correction capability of a
linear code, as far as I understood it. Motivated readers can take a look at
this course that explains the theory and practice of error control codes in
details (especially the "Linear Block Codes" chapter):
http://schof.colorado.edu/~ecen5682/notes.htm

But I think this not what we really want for hash keys. For instance, these 3
hash keys have a very high minimum Hamming distance between them, but would not
be good because K2 = K0 ^ K1:
K0 : 0101010101010101
K1 : 1010101010101010
K2 : 1111111111111111

The idea of taking the binary complement of some codes sounds really bad to me.
It means that if you have K0 = ~K1 and K2 = ~K3 the K0 ^ K1 = K2 ^ K3, as Dan
wrote.

BCH codes look better since, according to what is written in JC Weill's PhD
thesis, they protect more bits than random codes. This means that you have to
XOR a larger set of them to get 0. I think this is equivalent to Dan's idea of
"making the possibility to express some members of the set as linear
combinations of others as difficult as possible".

Another point which I think is important to take into consideration when
generating a good set of hash keys, is that what we want is that positions that
are close to each other (in "chess distance") should have different hash codes.
This means that getting 0 by XORing a set of hash codes that contains 'White
King in e1' with 'White King in g1' is a bigger problem than when XORing 'White
King in h1' with 'White King in a8' and even bigger than XORing 'White King in
X' with 'White King in Y' and 'White King in Z', since this last case never
happens in the difference between any two legal chess positions (there is only
one White King on the board !).

An idea I had to generate a good set of hash codes is to generate two
independant sets of 32 bit hash codes, each of them would be optimized by
minimizing the number of measured hash collisions in searches on representative
positions. The two best 32 bit sets would be merged to create the final 64 bit
keys. I fear that this idea is not practical since it would take a lot of
computing power.

I'd like to have the Hamming distance maximizers' opinion about these points.

Remi



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.