Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: CilkChess question for Don

Author: Robert Hyatt

Date: 18:43:25 01/31/99

Go up one level in this thread


On January 31, 1999 at 16:54:55, Don Dailey wrote:

>On January 31, 1999 at 11:27:23, Robert Hyatt wrote:
>
>>Don...  was cleaning out some files on Friday, and I found something about
>>cilkchess I had printed somewhere in the past, a discussion about cilkchess
>>and its implementation details on a SMP platform.
>>
>>One thing stood out like a sore thumb that I wanted to ask about...  You
>>mentioned that you don't 'lock the hash table at all'.  And I'm wondering how
>>you make that work.
>>
>>IE suppose you have a simple implementation that requires two words, one for
>>the 'signature' and one for the 'value'.  Since this is two words long, it
>>takes two instructions to store the entry and two to access the entry. What
>>do you do when the 'reader' reads the first word, but gets blocked by an
>>interrupt (or bus conflict or whatever) so that the 'writer' stores a new
>>signature _and_ value?  Because now the reader will get the second word and
>>it does _not_ go with the first.  So you get a wrong value.
>>
>>Harry and I spent several weeks debugging this in Cray Blitz way back, as I
>>was getting 'bad move from hash table' messages regularly, but only when
>>using more than one processor, so I was _certain_ it wasn't a real hash
>>signature collision.  We tracked it down to the above problem, and put a
>>'lock' around the few instructions that load or store.  IE here is what I do in
>>Crafty now:
>>
>>HashProbe:
>>
>>  Lock(lock_hasha);
>>  word1=htable->word1;
>>  word2=htable->word2;
>>  UnLock(lock_hasha);
>>
>>'word1 and word2' are local, while htable-> points to a globally shared hash
>>table.
>>
>>HashStore:
>>
>>    Lock(lock_hashb);
>>    htableb->word1=word1;
>>    htableb->word2=word2;
>>    UnLock(lock_hashb);
>>
>>If I remove the locks, I get _serious_ problems, because the signatures will
>>match, but the 'best move' will fail validation.  Which also means that the
>>score, the draft, the threat flags, etc are _all_ wrong for this position.
>>
>>Do you fix this?  Do you even detect this?  Or do you just ignore it, which
>>would seem to be impossible based on tests with CB and Crafty???
>>
>>Bob
>
>HI Bob,
>
>This one has been bothering me for a long time.  But we do absolutely
>nothing.  We are completely aware of the possible difficulties but I
>have never traced any problems to it.
>

do you grab the hash move and verify it?  This is where I found the problem
in Cray Blitz (I never had it in Crafty because after the very difficult task
of debugging that little oversight, I have never forgotten to do the lock
again....


>I did do a test to give me at lease some confidence that it was ok
>to do this for the time being.  I ran a test suite of problems for
>many repetitions.   I used about 20 problems,  let them run for about
>2 days and this represented a LOT of iterations running the same
>problems.   The problems were always solved on the same depth although
>occasionally the scores varied very slightly (but this is common with
>a parallel program.)  That is except for a single problem which was
>known anyway to have flaky behavior on that particular version of
>Cilkchess.  I even tested this by doing a simple test with a different
>hash key generator (different table of random numbers.)


I have no idea how many such glitches it takes to change a move at the root.
And I didn't catch the bug that way, I caught it because the hash table move
kept failing to 'verify' and I couldn't see how that could possibly happen
unless we were getting two identical hash signatures for two different
positions.  My analysis convinced me this wasn't happening, because I could
search forever with 1 processor with no problems, but with more than 1, it would
produce errors, more processors, more errors...




>
>I have never felt very comfortable with this and am surprised that I
>don't see a lot more trouble.  Mabye Cilkchess is 100 points weaker
>because of this I don't really know.   It's silly that we haven't
>just done the lock because it's trivial to implement.  The issue of
>course is will there be a substantial performance penalty.  There
>may be when we use a lot of processors.   I may see more problems
>when I use more processors too.
>
>What have you noticed concerning performance penalties?   I know you
>do not hash into the quies search so it might be less of an issue
>for you.  We can always back off an extra level or two to make up
>for the performance hit if it's substantial.


I am doing a 'lock' followed by four consecutive 32 bit load instructions,
followed by an unlock.  Or a lock followed by four consecutive store
instructions, followed by the unlock.  I don't notice _any_ performance issue
here at all, considering that these machines (my xeon/400) has a clock cycle
of 2.5ns, can execute 2 instructions per cycle best case, which gives me a
potential of 800M instructions per second.  I am doing about 200K nodes per
second, which translates into about 4000 instructions/node.  Those 4 load or
store instructions are a small fraction, so that serialization doesn't hurt,
until maybe large numbers of processors... but then I would probably go to a
lock for 'chunks' of the hash table rather than one lock for each of the two
tables I have...




>
>When I do the lock test I'll inform you of the results.  What do you
>actually do to observe this problem?  Do you just put in code that
>observes the conflict?


I can comment out the lock/unlocks and I get lots of 'bad move from hash
table' errors when the hash move fails validation...





>
>There is also another thing that might be a factor.  The real issue
>when it come to any hash table conflict should not be how often it
>affects some node, but how often does it change the root move?  We
>figured out that this is worth several bits.   I did a 32 bit key
>version of a previous program and compared it to a 64 bit key version
>by a lot of self test games and found that I could not observe that
>the program was weaker despite the fact that it was computed that we
>should be getting an illegal match at least once every second or two.


While I agree, are you willing to put the final round of a world championship
game in the hands of "Murphy"???   Not me.. :)



>
>It may be similar with this Bob.  Pehaps we are getting a relatively
>high number of these bad matches but maybe it has relatively little
>affect on the performance?   If this is true, then we might actually
>take a significant performance penalty for "fixing" the problem!
>
>It's an ugly thought, isn't it?   I think this might be a case where
>peace of mind is more important than a couple of rating points!
>
>
>
>- Don



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.