Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: best performance from rehashing schemes?

Author: Robert Hyatt

Date: 15:12:43 12/13/01

Go up one level in this thread


On December 13, 2001 at 16:35:01, Brian Richardson wrote:

>On December 13, 2001 at 15:37:14, Wylie Garvin wrote:
>
>>On December 13, 2001 at 15:05:43, Robert Hyatt wrote:
>>
>>>On December 13, 2001 at 06:07:37, Wylie Garvin wrote:
>>>
>>>>Hello,
>>>>
>>>>  Does anyone know what the best-performing rehashing scheme is?  I read
>>>>somewhere (Heinz) that multiple probes in one table perform better than separate
>>>>tables.  I've also heard that subtree size is best for choosing a node to
>>>>overwrite.  What does the board think?
>>>>
>>>>wylie
>>>
>>>
>>>Two answers.
>>>
>>>First, one based on theory.
>>>
>>>If you use the low order N bits of the hash signature as the table index,
>>>then one good approach to use, if you want to do multiple probes, is to
>>>use a _different_ set of bits from the hash signature as a "stride" or
>>>"offset" (stride is the vector machine terminology most might recognize)
>>>and add this stride to the original index value N-1 times to produce N
>>>hash addresses, each separated by the same specific stride.
>>>
>>>This is the best theoretical solution as it greatly reduces the chances
>>>for "chains" to form in memory.
>>>
>>Right.  This sounds like the answer I was looking for.  =)  Thanks Bob!
>>
>>>But for the PC, it has a problem.  If your hash entry is 16 bytes, eight
>>>probes as above will result in 8 cache line fills (reading 32 bytes per
>>>fill) which will swamp the PC memory bandwidth and kill performance.  I
>>>used this approach on the Cray, but with its vector hardware, there was
>>>no cost whatsoever for the extra 7 loads.
>>>
>>With the P4, I think cache lines are now 64 bytes.  I don't know whether this
>>actually makes the problem worse or not.
>>
>>>Another problem is that if you add some constant extracted from the original
>>>hash signature, to the computed hash index, you can run off the end of the
>>>hash table.  You can fix this either of two ways.  (1) after adding the
>>>offset, AND it with the original mask, which will "wrap it around" to the
>>>front of the table;  (2) add enough "slack" to the end of the table so that
>>>the offsets will still keep you "inside".  We had to do this on the cray,
>>>because we used a "vector load" and there is no facility to "wrap around"
>>>as in (1).
>>>
>>>For the PC, multiple probes may still be a good idea.  Unfortunately, most of
>>>the papers you will see leave out the _critical_ performance data.  They might
>>>say "8 probes searches a smaller tree" but what if that smaller tree is 10%
>>>smaller but the memory bandwidth required slows your program by 20%?  That is
>>>a net loss.
>>>
>>>In any case, for the PC, I would avoid the random offset explained above, and
>>>simply probe N consecutive table entries.  This at least uses only 1/2 the
>>>memory bandwidth since a single cache line fill of 32 bytes will now give you
>>>two entries rather than one.
>>>
>>Yes, your scheme (2) with N consecutive probes is the scheme I came up with.  I
>>didn't have a good justification for it other than cache locality.  Maybe that
>>is all the justification it needs.  As well as one cache line holding two
>>(four?) hash entries, there is another possibility: a hardware prefetching
>>mechanism can bring in the next cache line while you're still working on this
>>one.
>>
>>>Whether 8 is the right number or not should be tested.  We used 8 in Cray
>>>Blitz after testing 2,4,8,16 and 32.  I use 2 in Crafty, by using the Belle
>>>approach of two separate hash tables.  The right answer probably lies somewhere
>>>between 2 and 8...  For the PC.
>>
>>I think Heinz mentions in one of his papers that 4 probes worked best for
>>DarkThought.  Supposedly one table with N probes works better than N tables.
>>But here's what I wonder...if I'm going to use N consecutive entries in one
>>table, is that actually any better than N (interleaved) tables?  Intuition says
>>maybe, but not much.  Also I suppose without the vector support there's no
>>reason for N to be a power of two.
>>
>>wylie
>
>Most of the data I recall (ICCA articles) showing multi-probe advantages
>utilized relatively small hash tables (by today's standards).  I doubt that with
>upwards of even a millon (nevermind 16 million or more) entries that multiple
>probes buy much.  I think it really depends on the program, and its unique mix
>of search techniques, just like does q-search hashing pay or not.  Yes for some
>engines, no for others.  Last time I tried various multi-probe schemes with
>Tinker (depth preferred & always replace, 2 tables, +n probes),  all seemed to
>be net losers.  Of course, I might not have managed to implement a good
>multi-probe (or multi-table) approach...



For multiple-probe to work, you need a couple of conditions to be true.

1.  your table is not nearly large enough to hold the search space.  At 1M
nodes per second, for Crafty I need about 5 megabytes per second (assuming
about 1/3 of the total nodes searched are not q-search nodes, which is true
in the opening and middlegame, and since I don't hash q-search at all).  For
100 seconds, that is 500 megabytes.  Doable, but large.  For long games at 180
seconds (average) with peak searches of maybe 1800 seconds, that is going to
be overwritten many times.

2.  Your multiple-probe code must not slow you down more than the gain you get
by searching a smaller tree.   Remember that the goal is to find a move in the
shortest possible time, not necessarily by searching the smallest possible tree.

(2) seems to have doomed my multi-probe tests.  Ernst probably tested on an
alpha which has a _much_ higher memory bandwidth with that 256 bit bus.  On a
PC, it is hard to imagine it being effective unless the program is very slow
to start with.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.