Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Most obscure bug *ever*

Author: Robert Hyatt

Date: 14:57:17 10/23/02

Go up one level in this thread


On October 23, 2002 at 15:25:04, Colin Frayn wrote:

>As most of you know, writing chess programs is fairly tricky.  Weird bugs arise
>to do with NULL move and hashtables that take ages to fix, especially when you
>get lots of complicated algorithms all working against each other.
>
>So imagine my horror when Carlos at Chessbrain.net told me that Beowulf was
>occasionally returning empty PV strings from searches, indicating that the
>hashtable was broken (I get the PV from the hashtable directly).  This annoyed
>me especially as I'd not heard of this problem before and I realised I must have
>broken something recently.
>
>I tried to verify this on my PC at home in Windows, but couldn't.  Carlos tested
>it and we discovered that the problem only ever occurred rarely, and only on
>non-windows boxes.  At this stage I was thinking 'compiler error?' or perhaps
>'memory leak?'  Dreaded memory leaks take ages to find.
>
>Anyway, I did a *lot* of testing, with the fabled bug proving exceptionally
>elusive.  Often it would fail for a few goes and then fix itself randomly.  One
>time I was trying to get the bug to work on this one position, and it did so for
>the majority of the day, and then the following day that position was absolutely
>fine - no probem at all, whatever I did.  I began to wonder if it was something
>to do with the random number code, which would be seeded differently every time
>the program was run.  I replaced it all.
>
>Carlos provided me with a new position that failed and so I started to try the
>debugging.  Eventually I found out that the root position was not being updated
>properly.  I altered the hash replacement scheme, altered the hash update
>scheme, changed loads of things around in the search function, decided to store
>the full 64-bit hash key instead of just a 40 bit safe key.  basically, I spent
>ages trying to work out what was wrong, but still no luck.
>
>After a *lot* of testing, I finally managed to track the bug down to the fact
>that the hash key was becoming corrupted at some point during the search.  I
>began to test the DoMove() function, and also a few other things that could have
>caused this.  We installed electric fence and checked for memory leaks.  No joy
>(*sigh of relief*).  Somehow the hash key stored in the Board structure (which
>is continuously updated during DoMove()) had become corrupted so that it didn't
>correspond to the current board position any more.
>
>Then I managed to add in some debugging code which quit as soon as the key
>stored in the Board structure was no longer correct.  I got Beo to print out the
>position.  I couldn't help but notice that in just three ply, black had made one
>move but white had somehow moved his king about 4 squares.  Castling problem?
>By this time I had also replaced the entire random number generation code, and
>added in debugging code all over the place to print out messages in case of
>errors, but I felt I was getting closer.
>
>Then I suddenly realised.  Carlos had been sending strings to the engine of the
>form;
>3r4/5b2/1k1r1p2/Np5p/4P1p1/2R1KPP1/2P4P/R7 w
>
>(i.e. missing off the last two dashes.)  The full string should be
>3r4/5b2/1k1r1p2/Np5p/4P1p1/2R1KPP1/2P4P/R7 w - -
>
>but of course the shortened version should still be valid.  One of the positions
>I had 'fixed' before started working mysteriously after I decided to add back in
>those two dashes just for neatness.  Another started working again for a short
>while when I cut and pasted the entire line minus the end-of-line character, but
>then failed again shortly after one I cut out the whole line including the EOL
>char again.  Of course I didn't notice these at the time because I was so fixed
>on testing my hashtable.
>
>So what was the bug?  My FEN parser screwed up when it encountered a DoS EOL
>character whilst running on a non-DoS machine.  It ended up interpreting it as a
>totally spurious castling permission, sometimes allowing players to castle when
>they shouldn't have been able to and therefore messing everything up, but
>essentially at random, because the parser was just reading in the string for the
>FEN past the end, and stopping when it came to a space character.  If it met a
>K,Q,k or q in that space it would update the castling permissions, but of course
>this was random unallocated memory, so the result of this was also essentially
>random, meaning that sometimes it failed, and sometimes it didn't.
>
>So after all of this, it was just me writing a bad parser ;)
>
>How annoying is that?  :)
>Can anyone beat this story?
>
>Cheers,
>Col


You really need a function like Crafty's "Validate()".  It checks _everything_
that is of
interest, all the bitmaps, hash signatures, etc.  It has saved me a world of
grief as anytime
I see a spurious problem, I re-compile with -DDEBUG which calls that function
anytime
something interesting is updated, and it generally pinpoints the problem
immediately...




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.