Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Rehashing

Author: Robert Hyatt

Date: 06:03:26 01/21/98

Go up one level in this thread


On January 21, 1998 at 05:10:53, Amir Ban wrote:

>On January 20, 1998 at 18:42:01, Stuart Cracraft wrote:
>
>>On January 20, 1998 at 17:55:34, Amir Ban wrote:
>>
>>>On January 20, 1998 at 16:00:31, Bruce Moreland wrote:
>>>
>>>>
>>>>On January 20, 1998 at 15:27:01, Robert Hyatt wrote:
>>>>
>>>>>yes.  it's old news.  And a data structures course would call this
>>>>>the "hash collision chain problem".  rehashing eliminated chains, but
>>>>>produces more cache thrashing...  the latter is more important on a PC
>>>>>that already has poor memory bandwidth...
>>>>
>>>>Assume hash keys are perfectly random, there should be no discernible
>>>>relation between distinct positions, no matter how similar they are.
>>>>
>>>>So if you hash to index I in the table, there should be no more
>>>>liklihood that there is an occupied entry at index I+1 than there is at
>>>>index I+K or I+rand(), right?  If there is complete entropy here,
>>>>chaining should be uniform, regardless of how you rehash.  So you should
>>>>be able to use I+1, since that is easiest, if you really want to rehash.
>>>>
>>>>The fact that the hash elements are next to each other in memory, and
>>>>can be accessed with an index, has nothing to do with anything, does it?
>>>> There is no more relation betwen I and I+1 than there is between I and
>>>>J.
>>>>
>>>>The above is my intuition.  Intuition also tells me that the world is
>>>>flat, though.
>>>>
>>>>bruce
>>>
>>>
>>>I agree with that.
>>>
>>>Intutively I also think that there is no scope for improvement on the
>>>linear method. Your argument makes the point. Maybe a stronger way to
>>>say it is this: Complex probing methods are equivalent to selecting a
>>>different hash function, or to rehashing the hashed value. If the
>>>hashing is good to start with, this doesn't achieve anything.
>>>
>>>A different analogy: When you encrypt a message with some encryption
>>>method, some people feel that the message would be REALLY secret if at
>>>the end you multiply everything by 7 and add 510. That's not true, of
>>>course.
>>>
>>>Amir
>>
>>I am not sure I can agree with Amir/Bruce yet.
>>
>>A single hash function used once that hashes to a given
>>address and then use linear use the next 7 slots is going
>>to "back up" a lot faster when multiple board positions hash
>>to the same address. With random rehashing, they won't back
>>up (as much) since the randomizer will disperse them throughout
>>the table and not just in group of 7 slots after the main address.
>>
>>--Stuart
>
>That is correct, but in any other way you do this you will get the same
>amount of competition for each slot with the same random distribution.
>If you avoid the overloading of a slot through hash hits to the same
>place, you will find that statistically it gets equally overloaded
>through hash hits to different places.
>
>Amir


This is partially correct, but only partially.  Here is why.

assume we use 8 probes into the hash, and that we simply use N, N+1,
N+2,
..., N+7 as the probe addresses.  You'd like to think that P(anyentry)
is
1/X where X=number of entries, but this is not so... and you can easily
test
it in your program to confirm it.

So what happens is that you probe to position P, and use the set of 8
entries there as candidates for replacement.  Any position that hashes
to
P, or to P+1,2,3,4,5,6,7 tend to interfere with each other, because
hashing
to P+2, means we use the set P+2,3,4,5,6,7,8,9 which has 6 positions in
common with the position that hashed to P.

If you use random rehashing, based on extracting Y bits from the
original
hash signature, you will likely get two different strides for the
positions
that hashed to P and P+2.  So the position that hashes to P, would
likely
have the set P, P+a, P+2a, etc... while the position that hashes to P+2
would have the set P+2, P+2+b, P+2+2b,...

From hashing theory, this tends to eliminate "chaining".. because once
you
store at P or P+1, the probability of hitting that "chain" is 2/n,
because
that chain is of length 2, while the probability of hitting any other
position
(assuming table is empty except for P,P+1 is still 1/N.  So once a chain
gets
established, the probability of hitting it is higher than that of
hitting
any single empty entry.

So simple linear hashing works well, but, in those cases where the hash
probes are not really very uniformly distributed, and by simple
deduction
our hashing algorithm is *not* going to uniformly distribute all hash
keys,
simply based on random sampling theory, random rehashing actually does
help
to distribute hash entries, even when the original hash indices are not
well distributed.

To test, clear your hash, set a counter to the number of positions you
have,
and on each probe, if you store over an empty entry increment a counter
by
one, and if you probe at a used position, increment a different counter.
Compare the two to the thing telling you how many entries you have, and
decide whether those keys are as uniform as you originally thought.

I ran this a couple of years ago when we had the big hashing discussion
in
r.g.c.c.  The results are interesting...



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.