Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: hash table size - is a power of 2 still an advantage these days?

Author: Dieter Buerssner

Date: 14:32:16 09/24/03

Go up one level in this thread


On September 24, 2003 at 16:58:20, Gerd Isenberg wrote:

>Yes, but with Dieter's trick, one 32*32 = 64-bit mul is good enough with
>appropriate table sizes. If you get some speed for nothing, depending on your
>average cycles per node, and whether you probe in QSearch or not:
>
> UINT64	idx = (UINT64)tableSize * (hashKey64 & 0xffffffff);
> return (UINT)(idx >> 32);
>
>DIV - MUL latency (32-bit):
>Athlon:  40 cycles versus 6 (both vector path).
>Opteron: 39 cycles (vector path) versus 3 cycles (direct path).
>AND is one cycle on both.

Gerd, thanks for advertizing my idea :-) One may add, that one can get another
cycle in many cases, compared to the and method. Robert mentioned 48 byte
entries. This is not a power of two. So for the actual adress calculation, at
least one additional instruction will be needed (multiplying be the factor 3
that is one prime factor of 48). This factor can be multiplied into the
tableSize above in advance, and when using a macro (probably it is not easy with
inline here), that addional step will be avoided.

Somthing like

  struct hash_entry *p = (struct hash_entry *)((unsigned char *)(base_adress +
(size_t)((UINT64)tableSize_times_48 * (hashKey64 & 0xffffffff))));

The casts look horrible, but are Standard C. A decent compiler on x86 should
create one mul instruction, fetch edx of the result, and add it to base_adress
(or use an indexed adressing mode). Hidden in a macro, the code will not look
horrible (and as another advantage, you can easily change the macro, to try
different methods).

About the speed differences: I found practically none between modulo (on a 32
bit part of the hash_key, never tried it on 64 bits), mul and the & (2**k-1)
approach. Modulo really seemed a very little bit slower, but around the noise
level. Also, for large hash tables, we can expect over 500 cycles for the main
memory accesses. If we compare this to the instruction timings you have given,
and also take into account, that other parts of the engine will most probably
take significantly longer than the time in Hash Probe/Store functions, one
should conclude, that this is really nothing to worry about.

On architectures, where there mod is no upcode for mod, it might be a measurable
difference, but I guess still small on any slightly fast computer. When there is
no mul instruction either ...

Regards,
Dieter



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.