Author: Don Dailey
Date: 13:00:15 01/21/98
Go up one level in this thread
Stuart, I think the best way to resolve this is to try both ways. This issues can sometimes get very very trickly and non-intuitive. For instance consider: 1.) With linear probing, there is a definite upper bound on the number of probes you will have to apply to find an empty location (assuming it exists.) 2.) With re-hashing any number of probes less than INFINITE is possible (but not INFINITE itself!) This simple reasoning tells us the two are not equivalent. Here at MIT I have access to lot's of experts on this subject (including Ron Rivest and Charles Leiserson) and will try to pick their brains on this subject. - Don On January 21, 1998 at 09:03:26, Robert Hyatt wrote: >On January 21, 1998 at 05:10:53, Amir Ban wrote: > >>On January 20, 1998 at 18:42:01, Stuart Cracraft wrote: >> >>>On January 20, 1998 at 17:55:34, Amir Ban wrote: >>> >>>>On January 20, 1998 at 16:00:31, Bruce Moreland wrote: >>>> >>>>> >>>>>On January 20, 1998 at 15:27:01, Robert Hyatt wrote: >>>>> >>>>>>yes. it's old news. And a data structures course would call this >>>>>>the "hash collision chain problem". rehashing eliminated chains, but >>>>>>produces more cache thrashing... the latter is more important on a PC >>>>>>that already has poor memory bandwidth... >>>>> >>>>>Assume hash keys are perfectly random, there should be no discernible >>>>>relation between distinct positions, no matter how similar they are. >>>>> >>>>>So if you hash to index I in the table, there should be no more >>>>>liklihood that there is an occupied entry at index I+1 than there is at >>>>>index I+K or I+rand(), right? If there is complete entropy here, >>>>>chaining should be uniform, regardless of how you rehash. So you should >>>>>be able to use I+1, since that is easiest, if you really want to rehash. >>>>> >>>>>The fact that the hash elements are next to each other in memory, and >>>>>can be accessed with an index, has nothing to do with anything, does it? >>>>> There is no more relation betwen I and I+1 than there is between I and >>>>>J. >>>>> >>>>>The above is my intuition. Intuition also tells me that the world is >>>>>flat, though. >>>>> >>>>>bruce >>>> >>>> >>>>I agree with that. >>>> >>>>Intutively I also think that there is no scope for improvement on the >>>>linear method. Your argument makes the point. Maybe a stronger way to >>>>say it is this: Complex probing methods are equivalent to selecting a >>>>different hash function, or to rehashing the hashed value. If the >>>>hashing is good to start with, this doesn't achieve anything. >>>> >>>>A different analogy: When you encrypt a message with some encryption >>>>method, some people feel that the message would be REALLY secret if at >>>>the end you multiply everything by 7 and add 510. That's not true, of >>>>course. >>>> >>>>Amir >>> >>>I am not sure I can agree with Amir/Bruce yet. >>> >>>A single hash function used once that hashes to a given >>>address and then use linear use the next 7 slots is going >>>to "back up" a lot faster when multiple board positions hash >>>to the same address. With random rehashing, they won't back >>>up (as much) since the randomizer will disperse them throughout >>>the table and not just in group of 7 slots after the main address. >>> >>>--Stuart >> >>That is correct, but in any other way you do this you will get the same >>amount of competition for each slot with the same random distribution. >>If you avoid the overloading of a slot through hash hits to the same >>place, you will find that statistically it gets equally overloaded >>through hash hits to different places. >> >>Amir > > >This is partially correct, but only partially. Here is why. > >assume we use 8 probes into the hash, and that we simply use N, N+1, >N+2, >..., N+7 as the probe addresses. You'd like to think that P(anyentry) >is >1/X where X=number of entries, but this is not so... and you can easily >test >it in your program to confirm it. > >So what happens is that you probe to position P, and use the set of 8 >entries there as candidates for replacement. Any position that hashes >to >P, or to P+1,2,3,4,5,6,7 tend to interfere with each other, because >hashing >to P+2, means we use the set P+2,3,4,5,6,7,8,9 which has 6 positions in >common with the position that hashed to P. > >If you use random rehashing, based on extracting Y bits from the >original >hash signature, you will likely get two different strides for the >positions >that hashed to P and P+2. So the position that hashes to P, would >likely >have the set P, P+a, P+2a, etc... while the position that hashes to P+2 >would have the set P+2, P+2+b, P+2+2b,... > >From hashing theory, this tends to eliminate "chaining".. because once >you >store at P or P+1, the probability of hitting that "chain" is 2/n, >because >that chain is of length 2, while the probability of hitting any other >position >(assuming table is empty except for P,P+1 is still 1/N. So once a chain >gets >established, the probability of hitting it is higher than that of >hitting >any single empty entry. > >So simple linear hashing works well, but, in those cases where the hash >probes are not really very uniformly distributed, and by simple >deduction >our hashing algorithm is *not* going to uniformly distribute all hash >keys, >simply based on random sampling theory, random rehashing actually does >help >to distribute hash entries, even when the original hash indices are not >well distributed. > >To test, clear your hash, set a counter to the number of positions you >have, >and on each probe, if you store over an empty entry increment a counter >by >one, and if you probe at a used position, increment a different counter. >Compare the two to the thing telling you how many entries you have, and >decide whether those keys are as uniform as you originally thought. > >I ran this a couple of years ago when we had the big hashing discussion >in >r.g.c.c. The results are interesting...
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.