Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Data structures question for hash tables

Author: Robert Hyatt
Date: 16:54:44 01/20/99
On January 20, 1999 at 16:33:24, Robert Hyatt wrote:

>On January 20, 1999 at 16:11:54, Bruce Moreland wrote:
>
>>
>>On January 19, 1999 at 15:43:41, Robert Hyatt wrote:
>>
>>>On January 19, 1999 at 13:06:33, Bruce Moreland wrote:
>>>
>>>>
>>>>On January 19, 1999 at 08:13:52, Robert Hyatt wrote:
>>>>
>>>>>On January 18, 1999 at 19:01:01, Dann Corbit wrote:
>>>>
>>>>>(2) use some form of hash 'buckets'.  IE in cray blitz,
>>>>>I took the low order N bits of the hash signature as the probe address.  I took
>>>>>the next 10 bits as a 'rehash offset'.  (call this RO).  I then checked the
>>>>>positions at offset, offset+RO, offset+2RO, ... offset+7RO. Either approach is
>>>>>ok.  The latter tends to solve the 'chaining' problem because two different
>>>>>hash keys identical in the last N bits (probe address) still may be different in
>>>>>the next 10 bits, giving a different 'bucket'.
>>>>
>>>>I have never understood this logic, and would like to discuss it.
>>>>
>>>>The situation is that you get a collision at one point in the table, and the
>>>>question is where to rehash to.
>>>>
>>>>I would like to ask why not just set RO to some constant, for instance one.
>>>>You've got a collision in a table that will tend to be full, I don't understand
>>>>why the element at some random location in the table is more likely to be empty
>>>>or unimportant than the one at the next index in the table.
>>>>
>>>>    for (;;) {
>>>>        if (FOverwrote(index))
>>>>            return;
>>>>        index = (index + 1) % tabsize;
>>>>    }
>>>>
>>>>As opposed to:
>>>>
>>>>    ro = (key >> N) & 2047;
>>>>    for (;;) {
>>>>        if (FOverwrote(index))
>>>>            return;
>>>>        index = (index + ro) % tabsize;
>>>>    }
>>>>
>>>>I don't see what you get for your extra instructions.
>>>>
>>>>bruce
>>>
>>>
>>>This does two things...  the most important of which is to eliminate 'chaining'
>>>in the probing/storing.  IE if you hash to entry N in the table, and you use
>>>a constant offset, you get a series of entries N, N+R, N+2R, ..., N+7R.  If
>>>another key hashes to N+2R (different low order n bits), now you use the bucket
>>>N+2R, N+3R, ..., N+9R.  Notice that those two have 6 entries in common.
>>>
>>>In Cray Blitz, or in the approach commonly called 'rehashing' the first bucket
>>>is still N, but now R is a variable extracted directly from the hash key.  So
>>>we probe at N, N+R1, N+2R1 ... N+7R1, and for the second try, we go to N+2R1
>>>(I will stick with the second entry maps to N+2R from above), but we now
>>>probe N+2R1, N+2R1+R2, N+2R1+2R2, ..., N+2R1+7R7, which only has the first
>>>entry in common, unless R1=R2 which is unlikely since 0 <= R1 <= 1023 in our
>>>hashing scheme.
>>>
>>>The second benefit (mainly on a Cray) is that the probes above are done as
>>>vector operations.  And now the two vector loads have different 'strides' which
>>>means we don't line two processors up on the same group of memory banks (the
>>>cray can use _huge_ levels of interleaving as in 256 banks and up) which means
>>>this is significantly faster due to fewer bank conflicts between the two
>>>processors.  This turns out to be a really significant thing on a Cray, and was
>>>the _main_ reason we did it.
>>>
>>>Lastly it does a better job of spreading hash entries around in the table as the
>>>first point shows, which helps make sure all the table gets used.
>>>
>>>On a PC this is not so great.  The best plan there is to test 8 consecutive
>>>entries because of how cache is burst-loaded 32 bytes at a time.  The old CB
>>>approach results in 8 X 32byte cache line fills (8 entries, even though each
>>>entry was only 12 bytes long), while the more 'PC-ish' approach would reference
>>>8 entries in 96 consecutive bytes, which is only 3 cache line fills.  Almost 3x
>>>faster.
>>>
>>>Hope that helps...
>>
>>I still don't understand why this does anything.  If you are standing in the
>>middle of a crowd and throw a rock in the air, you are going to hit someone in
>>the head.  I don't see why there is any less chance of hitting someone if you
>>throw it so it lands ten feet from you or a hundred.
>>
>>I don't see why "next to" is a more interesting relationship in a hash table
>>than "N away from" is.
>>
>>If you want to talk to someone, and you don't care who, you can pick a phone
>>number at random out of the phone book and call.  If they don't answer the
>>phone, I don't know why you'd be more or less likely to get someone to answer
>>next time if you pick the next entry or if you pick from a random spot in the
>>book.
>>
>>bruce
>
>You missed the most important point:  If I access a group of entries with
>stride=N, and on another processor I use the same stride, I get lots of memory
>bank conflicts that slow the cray down a bunch.  If I use stride N on one cpu
>and M (where M != N) then I don't hit the same memory bank very often and get
>fewer conflicts and higher performance.
>
>Also, you stand a better chance of filling the table completely if when you
>probe to position N, some probes are N+X, 2X and so forth, while others are
>N+Y, 2Y, and so forth.  Not much of an issue with 2 entries, or even with 8,
>but for classic hashing it is better...


Now that I am out of class for the day, I can finish my answer.  The idea of
'rehashing' comes from the old days where hashing meant a way to distribute
data over a finite sized table, _without_ losing anything.  This means that
when this was used, we _had_ to store the new data, and we could _not_
overwrite anything already there (when the table became nearly full, we were
in trouble, and this is _not_ talking about chess but more later).  So when
you probe to a random entry and it is already used, you _must_ try another
address.  If you use sequential indexing, you create 'clusters' or 'chains'
that are bad.  Because if you have a chain of N entries, the probability of
the next probe hitting that 'chain' is N times as great as it is of hitting
any other position.  And every time you 'hit' you extend the chain.

now, how to use this.  I wrote a clinic system (A/R) back in the 80's that
had to run on a z80 micro (8 bit cpu, 64kb memory) with a 40mb hard disk.  No
memory to do a fancy indexed sequential file access method, so I just used
random hashing to convert a medical record number to a record number in the
random-access file.  And after trying sequential probes for collisions, I ended
up using random rehashing to spread the data over the entire file without
creating long 'chains'.  That software is _still_ running in one clinic in
Mississippi...  40mb disk, 20,000 patients, 8 doctors, seeing about 200 patients
every last day...  And the last time they compared notes, we were faster than
any machine they brought in to compare against.  :)
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.