Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Data structures question for hash tables

Author: Robert Hyatt

Date: 05:55:39 01/21/99

Go up one level in this thread


On January 21, 1999 at 03:38:03, Amir Ban wrote:

>On January 20, 1999 at 20:17:04, Eugene Nalimov wrote:
>
>>On January 20, 1999 at 16:17:53, Bruce Moreland wrote:
>>
>>>
>>>On January 19, 1999 at 14:16:54, Eugene Nalimov wrote:
>>>
>>>>Linear rehash results in large clusters of non-empty elements.
>>>>The element that conflicts with *any* element in the cluster
>>>>will be added to that cluster, so big cluster will be bigger
>>>>and bigger over the time. With non-linear rehash conflicting
>>>>elements will be added to other clusters. You can find
>>>>discussion, analysis, and experimental results in
>>>>
>>>>  D. Knuth, The Art of Computer Programming, Vol.3,
>>>>            2nd edition, pp. 526-531.
>>>>
>>>>(the data is there in the 1st edition, too, but page numbers
>>>>are different).
>>>>
>>>>At least that will happen with classic hash table, without limit
>>>>on the # of retries. When there is such a limit I don't know what
>>>>will happen, but suspect that when that limit is large enough
>>>>(e.g. 8), non-linear rehash will be better (in the terms of hash
>>>>table collisions), too - almost all assumptions are the same as in
>>>>the classic case.
>>>
>>>I read this after I responded to Bob.  I will assume Knuth knows what he is
>>>talking about, although I still don't understand intuitively why the degradation
>>>on *average* won't be the same in both cases, since rather than rarely hitting a
>>>real bad case, you are more commonly hitting less bad cases.
>>>
>>>Unfortunately this won't help me either way since I don't rehash.
>>>
>>>bruce
>>
>>Ok. Here is simplified example. Let's assume that we have
>>large enough hash table, and there are 20 non-empty entries.
>>Now assume that we search each of the keys equal number of
>>times.
>>
>>Case 1. There are 10 clusters, each of length 2. Then we
>>have to probe 1.5 entries in average during the search (in
>>half of the probes we'll find the entry on the first try,
>>in half - on the second try).
>>
>>Case 2. There is 1 cluster that contains all 20 entries.
>>Here average number of probes is 10.5.
>>
>>Eugene
>>
>
>This only explains why clustering is bad. It doesn't explain why doing complex
>rehashing get less clustering than sequential rehashing (i.e. adding 1). If you
>do complex rehashing, you don't care about clusters of adjacent cells, but you
>do care about clusters that are adjacent according to your complicated formula,
>and there's no obvious reason why you would do better.



but there clearly is.  IE if you do simple consecutive probes, assuming you
do 8 probes to find the best, you end up with a bucket of 8 entries.  *any*
probe that initially hits any of those 8 entries 'shares' the remainder of
the entries that follow that probe with the same bucket:

  X X X X X X X X
      Y Y Y Y Y Y Y Y

the first probe for Y hits the 3rd probe for X, so the two 'buckets' overlap
here.

If you use other bits from the hash signature to 'stride' the probes, you
would get something like this:

   X   X   X   X   X   X   X   X
           Y    Y    Y    Y    Y    Y    Y    Y

notice how the two buckets have two entries in common and not 6.  That is the
main point.  Also notice that the two hashes spread things out better, rather
than clumping everything up at the same point in the table (P(hit bucket) is
higher with inc=1 than it is with inc=random).



>
>Another way of looking at this is that adopting any deterministic rehashing
>function is equivalent to doing sequential rehashing after some mapping of the
>entries. E.g. if you rehash in increments of 3's, you may as well have
>rearranged your entries 3,6,9,... mod tablesize and done sequential rehashing.


right, but don't do this.  get the rehash increment from the hash signature,
(ala' cray blitz).  Then there is a good chance that two hash signatures that
match in the lower order N bits (probe index) won't match in the rehash index.



>
>Obviously, if your hash function is good enough to start with, applying any
>further rehashing on it by some fixed mapping won't improve it, just like
>multiplying a random number by 17 and adding 24 doesn't make it more random.
>
>Amir

See above.



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.