Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Hash tables and data-cache, some programmer stuff...

Author: Peter W. Gillgasch

Date: 04:36:26 01/18/98

Go up one level in this thread


 >Posted by Ed Schröder on January 18, 1998 at 05:18:33:
>
>In Reply to: Re: Hash tables and data-cache, some programmer stuff... posted
>by Robert Hyatt on January 17, 1998 at 18:22:09:
>
>Posted by Robert Hyatt on January 17, 1998 at 18:22:09:

>>>You seem to have missed the point I was trying to make regarding
>>>hash tables. If you use a big hash table you will access (write or
>>>read) a wide area of memory. All of this is written to the data-cache
>>>with the risk of overwriting "more important" information (such as
>>>variables often used). Using a small hash table this risk is less.
>>>Result a higher NPS.

>>I don't see how this would happen.  IE on a pentium pro, you have 256KB
>>of
>>cache.  My hash entry is 16 bytes wide.  This means a 16K entry hash
>>table
>>will just exactly fit in cache.

Hold it. You don't think that just because one data set is a big as
the 2nd ;) level cache it stays there like you nailed it on the chip?

Fact is that everything blows away your cache. Interrupts, CPU stack,
table lookups etc. On the Mac I see a huge different between a system
with all kinds of INITs doing things at interrupt time (which execute
emulated, 68k emulator blows away my cache). Virtual memory being
activated slows me down as well (don't know why, really). In the
graphical version of Demon the first thing I do is to send "quit"
Apple Events to all other Apps running on the machine (including the
Finder) (in Unix terms, I ask all application processes including the
shell to kill themselves).

Big difference.

>> If you use 32K or 64K you are already
>>thrashing it badly...  I've never run with a hash that small.  I believe
>>my default is 3/4mb, which is 3x the smaller P6 cache...
>
>Just test it and you will see.
>See below.
>
>Not to speak I was mainly talking about the first level cache and not
>about the second one :)

2nd level cache is much too slow. If the 2nd level cache would not be
so damned firmly seated in my Mac I'd remove it now, just to see how
much it contributes in my case. Hm. I should really try to do that...
But I don't want to rip my mainboard apart ;) Would be hard to get a
replacement for such a nostalgic machine ;)

>>>The second point I was trying to make is the use of multiple hash
>>>tries. If you do just one probe you save the data-cache. If you do
>>>more (I do 3) this will lower the NPS because you (without realizing
>>>it) pollute the data-cache with information that is not very useful.
>>>Result a lower NPS because now in many cases information must come
>>>from normal memory and you know how slow that is :)

>>yes.  ALthough if you probe consecutive addresses the speed penalty is
>>zilch.  Remember that the Pentiums all fetch 32 bytes from memory on
>>each
>>cache miss, because they all use 32 byte cache lines.  So you access any
>>byte in memory and you stream in 32.  And there's nothing that can be
>>done.  So doing 2 consecutive probes on 16 byte entries would not slow
>>you down any more than doing only 1.  You thrash cache to exactly the
>>same extent because of the line-fill policy... 3 is worse than 2, but
>>4 is no worse than 3.  Again assuming you are accessing consecutive
>>words.  I don't do this *yet* but I have started rewriting my two-level
>>cache already to take advantage of this property of the Pentium cache...

Ok, this is interesting. As I added the xref to Demon, I saw a slow down
by about 50 % if I recall correctly (maybe it was 45%, not sure). I
rehashed
like Bob did it in Cray Blitz (silly me). I wasn't happy with that since
a
50% slowdown was in the order of 50 - 90 K on my machine. I switched
to linear probing to max out cache bandwidth and the slow down is now in
the 20 % range. I am going to run some tests now with different numbers
of rehashes (note that I don't hash anything in the qsearch part of the
tree for obvious reasons).

The usual setting with 8 probes results in the log file I posted some
days ago:

00:00:00  1/  1 Rf1xf4 -567 (-4.429688)
00:00:00  1/  3 Re3-d3 30 (0.234375)
00:00:00  1/  4 Re3-f3 57 (0.445313)
00:00:00  1/  5 Re3-g3 Bd2xg5 Qe2xf1 111 (0.867188)
00:00:00  2/  1 Re3-g3 Qg5xg3 -680 (-5.312500)
00:00:00  2/  2 Re3-f3 Bd7xh3 -105 (-0.820313)
00:00:00  2/  5 Qe2-c4 Rf8-f7 25 (0.195313)
00:00:00  3/  1 Qe2-c4 Kg8-h8 Re3-g3 Qg5-f5 70 (0.546875)
>> 136
00:00:00  3/ 12 Re3-g3  f4-f3 Bd2xg5 136 (1.062500)
>> 168
00:00:00  4/  1 Re3-g3    O-O 111 (0.867188)
>> 160
00:00:00  5/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Bd6xf8 223 (1.742188)
00:00:00  6/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Bd2-g5 Rf8xf1 Qe2xf1 227
(1.773438)
>> 275
00:00:01  7/  1 Re3-g3 Qg5xg3 Bd2xf4    O-O 272 (2.125000)
>> 304
00:00:03  8/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Rf1xf8 Bd6xf8 Bd2-g5 Rh4-h6
Bg5xh6 483 (3.773438)
00:00:05  9/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Rf1xf8 Bd6xf8 Bd2-g5 Rh4-h6
Bg5xh6 483 (3.773438)
>> 515
00:00:12 10/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Rf1xf8 Kg8xf8 Qe2-f3 Kf8-g7
Bd2-g5 Rh4-h6 Bg5xh6 Kg7xh6 534 (4.171875)
00:00:51 11/  1 Re3-g3  h7-h6 Rg3xg5  h6xg5 Nh1-f2  b7-b5 Nf2-d3 Rh4-h7
Nd3-e5 Bd7-e6 Ne5xc6 Be6xa2 549 (4.289063)

00:01:00 6122978 nodes NPS: 101710.598007

Here the same with 4 probes:

00:00:00  1/  1 Rf1xf4 -567 (-4.429688)
00:00:00  1/  3 Re3-d3 30 (0.234375)
00:00:00  1/  4 Re3-f3 57 (0.445313)
00:00:00  1/  5 Re3-g3 Bd2xg5 Qe2xf1 111 (0.867188)
00:00:00  2/  1 Re3-g3 Qg5xg3 -680 (-5.312500)
00:00:00  2/  2 Re3-f3 Bd7xh3 -105 (-0.820313)
00:00:00  2/  5 Qe2-c4 Rf8-f7 25 (0.195313)
00:00:00  3/  1 Qe2-c4 Kg8-h8 Re3-g3 Qg5-f5 70 (0.546875)
>> 136
00:00:00  3/ 12 Re3-g3  f4-f3 Bd2xg5 136 (1.062500)
>> 168
00:00:00  4/  1 Re3-g3    O-O 111 (0.867188)
>> 160
00:00:00  5/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Bd6xf8 223 (1.742188)
00:00:00  6/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Bd2-g5 Rf8xf1 Qe2xf1 227
(1.773438)
>> 275
00:00:01  7/  1 Re3-g3 Qg5xg3 Bd2xf4    O-O 272 (2.125000)
>> 304
00:00:02  8/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Rf1xf8 Bd6xf8 Bd2-g5 Rh4-h6
Bg5xh6 483 (3.773438)
00:00:05  9/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Rf1xf8 Bd6xf8 Bd2-g5 Rh4-h6
Bg5xh6 483 (3.773438)
>> 515
00:00:12 10/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Rf1xf8 Kg8xf8 Qe2-f3 Kf8-g7
Bd2-g5 Rh4-h6 Bg5xh6 Kg7xh6 534 (4.171875)
00:00:51 11/  1 Re3-g3  h7-h6 Rg3xg5  h6xg5 Nh1-f2  b7-b5 Nf2-d3 Rh4-h7
Nd3-e5 Bd7-e6 Ne5xc6 Be6xa2 549 (4.289063)

00:01:00 6142560 nodes NPS: 102064.137358

Those tests have been run with 2 ** 18 entries per side.

Here is the 8 probe test with a minimal hash table (2 ** 16 entries
per side, I can't make my xref smaller than that without rewriting
things significantly):

00:00:00  1/  1 Rf1xf4 -567 (-4.429688)
00:00:00  1/  3 Re3-d3 30 (0.234375)
00:00:00  1/  4 Re3-f3 57 (0.445313)
00:00:00  1/  5 Re3-g3 Bd2xg5 Qe2xf1 111 (0.867188)
00:00:00  2/  1 Re3-g3 Qg5xg3 -680 (-5.312500)
00:00:00  2/  2 Re3-f3 Bd7xh3 -105 (-0.820313)
00:00:00  2/  5 Qe2-c4 Rf8-f7 25 (0.195313)
00:00:00  3/  1 Qe2-c4 Kg8-h8 Re3-g3 Qg5-f5 70 (0.546875)
>> 136
00:00:00  3/ 12 Re3-g3  f4-f3 Bd2xg5 136 (1.062500)
>> 168
00:00:00  4/  1 Re3-g3    O-O 111 (0.867188)
>> 160
00:00:00  5/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Bd6xf8 223 (1.742188)
00:00:00  6/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Bd2-g5 Rf8xf1 Qe2xf1 227
(1.773438)
>> 275
00:00:00  7/  1 Re3-g3 Qg5xg3 Bd2xf4    O-O 272 (2.125000)
>> 304
00:00:02  8/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Rf1xf8 Bd6xf8 Bd2-g5 Rh4-h6
Bg5xh6 483 (3.773438)
00:00:04  9/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Rf1xf8 Bd6xf8 Bd2-g5 Rh4-h6
Bg5xh6 483 (3.773438)
>> 515
00:00:12 10/  1 Re3-g3 Qg5xg3 Nh1xg3  f4xg3 Rf1xf8 Kg8xf8 Qe2-f3 Kf8-g7
Bd2-g5 Rh4-h6 Bg5xh6 Kg7xh6 534 (4.171875)
00:00:50 11/  1 Re3-g3  h7-h6 Rg3xg5  h6xg5 Nh1-f2  b7-b5 Nf2-d3 Rh4-h7
Nd3-e5 Bd7-e6 Ne5xc6 Be6xa2 549 (4.289063)

00:01:00 6146330 nodes NPS: 102070.246333

So, no big difference in this test position. I guess that means that
Chess Demon is simply not memory bandwidth bound and that the
replacement
scheme is as perfect as it can be 8^) [I worked very hard on the
replacement
scheme, since the thing is far too fast for any reasonable amout of
memory
to hold the hash table].

>Just create a dummy 4 Mb area in Crafty and then in every evaluation
>you via the current hash-key just read the first 4 four bytes of the
>entry (so just one 32 bit load). Looking at this code it's cheap and
>you may expect no NPS drop at all. But the NPS will drop considerable.

Not neccessarily. If Bob is trashing his cache badly already (which I
suspect) he won't see any big slowdown. Such effects are only noticable
if you are "close" to running in the cache. If you are really running
in the 1st level cache the effect is a total desaster.

-- Peter



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.