Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Data structures question for hash tables

Author: Eugene Nalimov
Date: 13:07:54 01/21/99
On January 21, 1999 at 03:38:03, Amir Ban wrote:

>On January 20, 1999 at 20:17:04, Eugene Nalimov wrote:
>
>>On January 20, 1999 at 16:17:53, Bruce Moreland wrote:
>>
>>>
>>>On January 19, 1999 at 14:16:54, Eugene Nalimov wrote:
>>>
>>>>Linear rehash results in large clusters of non-empty elements.
>>>>The element that conflicts with *any* element in the cluster
>>>>will be added to that cluster, so big cluster will be bigger
>>>>and bigger over the time. With non-linear rehash conflicting
>>>>elements will be added to other clusters. You can find
>>>>discussion, analysis, and experimental results in
>>>>
>>>>  D. Knuth, The Art of Computer Programming, Vol.3,
>>>>            2nd edition, pp. 526-531.
>>>>
>>>>(the data is there in the 1st edition, too, but page numbers
>>>>are different).
>>>>
>>>>At least that will happen with classic hash table, without limit
>>>>on the # of retries. When there is such a limit I don't know what
>>>>will happen, but suspect that when that limit is large enough
>>>>(e.g. 8), non-linear rehash will be better (in the terms of hash
>>>>table collisions), too - almost all assumptions are the same as in
>>>>the classic case.
>>>
>>>I read this after I responded to Bob.  I will assume Knuth knows what he is
>>>talking about, although I still don't understand intuitively why the degradation
>>>on *average* won't be the same in both cases, since rather than rarely hitting a
>>>real bad case, you are more commonly hitting less bad cases.
>>>
>>>Unfortunately this won't help me either way since I don't rehash.
>>>
>>>bruce
>>
>>Ok. Here is simplified example. Let's assume that we have
>>large enough hash table, and there are 20 non-empty entries.
>>Now assume that we search each of the keys equal number of
>>times.
>>
>>Case 1. There are 10 clusters, each of length 2. Then we
>>have to probe 1.5 entries in average during the search (in
>>half of the probes we'll find the entry on the first try,
>>in half - on the second try).
>>
>>Case 2. There is 1 cluster that contains all 20 entries.
>>Here average number of probes is 10.5.
>>
>>Eugene
>>
>
>This only explains why clustering is bad. It doesn't explain why doing complex
>rehashing get less clustering than sequential rehashing (i.e. adding 1). If you
>do complex rehashing, you don't care about clusters of adjacent cells, but you
>do care about clusters that are adjacent according to your complicated formula,
>and there's no obvious reason why you would do better.
>
>Another way of looking at this is that adopting any deterministic rehashing
>function is equivalent to doing sequential rehashing after some mapping of the
>entries. E.g. if you rehash in increments of 3's, you may as well have
>rearranged your entries 3,6,9,... mod tablesize and done sequential rehashing.
>
>Obviously, if your hash function is good enough to start with, applying any
>further rehashing on it by some fixed mapping won't improve it, just like
>multiplying a random number by 17 and adding 24 doesn't make it more random.
>
>Amir

Yes, adding 3 (or 5, or any predefined constant) instead of 1 will
not help to avoid clustering. But adding unique value for each hash
key will.

So again: we evaluated hash key. We used low M bits as a hash table
index. It happened that that hash entry is occupied by other value.
What we'll do now?

If we'll just search sequentally till we find our value or an empty
slot, than we'll *always* add new value to that particular cluster.
Large cluster will grow faster than the smaller one because
(1) more values conflict with large claster,
(2) element that conflicts with *any* value in the cluster will be
    added to that cluster.

When you'll use high N bits of a hash key as a rehash constant,
there is large enough chance that new value will be added to some
other cluster. Of course there are secondary clusters, but overall
performance - in terms of hash table conflicts - is better. For
details look into Knuth, vol. 3 (I gave exact reference in that
thread some time ago).

Of course that apples to "classic" hash tables. Hash tables that
are used in the chess program differs in many ways, e.g.

(1) You can more or less painlessly overwrite the old data.
(2) There is some limit on # of searched entries - that limits
    clusters growth.
(3) Instead of just 2 values - "hash entry unused and useless" and
    "hash entry is valuable", there is wider scale - "unused, depth
    0, depth 1, ..., depth MAXPLY".
(4) During games with longer time controls hash table is 100% full.
(5) Hash table is huge (typical chess program fits into L2 processor
    cache, but hash table occupies *all* machine memory), so any
    non-sequental probes are very expensive in terms of CPU clocks.

The question "are those differencies serious enough, so chess programs
can succesfully use rehashing algorithms that are inferior in classic
hash tables applications" is not trivial. I suspect that answer should
be "it depends on the program and/or machine architecture".

Eugene
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.