Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Most obscure bug *ever*

Author: Frank Phillips

Date: 13:54:18 10/23/02

Go up one level in this thread


On October 23, 2002 at 15:25:04, Colin Frayn wrote:

>As most of you know, writing chess programs is fairly tricky.  Weird bugs arise
>to do with NULL move and hashtables that take ages to fix, especially when you
>get lots of complicated algorithms all working against each other.
>
>So imagine my horror when Carlos at Chessbrain.net told me that Beowulf was
>occasionally returning empty PV strings from searches, indicating that the
>hashtable was broken (I get the PV from the hashtable directly).  This annoyed
>me especially as I'd not heard of this problem before and I realised I must have
>broken something recently.
>
>I tried to verify this on my PC at home in Windows, but couldn't.  Carlos tested
>it and we discovered that the problem only ever occurred rarely, and only on
>non-windows boxes.  At this stage I was thinking 'compiler error?' or perhaps
>'memory leak?'  Dreaded memory leaks take ages to find.
>
>Anyway, I did a *lot* of testing, with the fabled bug proving exceptionally
>elusive.  Often it would fail for a few goes and then fix itself randomly.  One
>time I was trying to get the bug to work on this one position, and it did so for
>the majority of the day, and then the following day that position was absolutely
>fine - no probem at all, whatever I did.  I began to wonder if it was something
>to do with the random number code, which would be seeded differently every time
>the program was run.  I replaced it all.
>
>Carlos provided me with a new position that failed and so I started to try the
>debugging.  Eventually I found out that the root position was not being updated
>properly.  I altered the hash replacement scheme, altered the hash update
>scheme, changed loads of things around in the search function, decided to store
>the full 64-bit hash key instead of just a 40 bit safe key.  basically, I spent
>ages trying to work out what was wrong, but still no luck.
>
>After a *lot* of testing, I finally managed to track the bug down to the fact
>that the hash key was becoming corrupted at some point during the search.  I
>began to test the DoMove() function, and also a few other things that could have
>caused this.  We installed electric fence and checked for memory leaks.  No joy
>(*sigh of relief*).  Somehow the hash key stored in the Board structure (which
>is continuously updated during DoMove()) had become corrupted so that it didn't
>correspond to the current board position any more.
>
>Then I managed to add in some debugging code which quit as soon as the key
>stored in the Board structure was no longer correct.  I got Beo to print out the
>position.  I couldn't help but notice that in just three ply, black had made one
>move but white had somehow moved his king about 4 squares.  Castling problem?
>By this time I had also replaced the entire random number generation code, and
>added in debugging code all over the place to print out messages in case of
>errors, but I felt I was getting closer.
>
>Then I suddenly realised.  Carlos had been sending strings to the engine of the
>form;
>3r4/5b2/1k1r1p2/Np5p/4P1p1/2R1KPP1/2P4P/R7 w
>
>(i.e. missing off the last two dashes.)  The full string should be
>3r4/5b2/1k1r1p2/Np5p/4P1p1/2R1KPP1/2P4P/R7 w - -
>
>but of course the shortened version should still be valid.  One of the positions
>I had 'fixed' before started working mysteriously after I decided to add back in
>those two dashes just for neatness.  Another started working again for a short
>while when I cut and pasted the entire line minus the end-of-line character, but
>then failed again shortly after one I cut out the whole line including the EOL
>char again.  Of course I didn't notice these at the time because I was so fixed
>on testing my hashtable.
>
>So what was the bug?  My FEN parser screwed up when it encountered a DoS EOL
>character whilst running on a non-DoS machine.  It ended up interpreting it as a
>totally spurious castling permission, sometimes allowing players to castle when
>they shouldn't have been able to and therefore messing everything up, but
>essentially at random, because the parser was just reading in the string for the
>FEN past the end, and stopping when it came to a space character.  If it met a
>K,Q,k or q in that space it would update the castling permissions, but of course
>this was random unallocated memory, so the result of this was also essentially
>random, meaning that sometimes it failed, and sometimes it didn't.
>
>So after all of this, it was just me writing a bad parser ;)
>
>How annoying is that?  :)
>Can anyone beat this story?
>
>Cheers,
>Col


Hardly  compares to yours, but I had fun chasing an error caused by the way gvim
seems to work.  Essentially by pressing the wheel on my mouse button I had
inserted the last clipboard into the code in a random position while scrolling
using the wheel.

The program was locking up in the endgame.  I use a hash table for repetition
detection (thanks Bruce) and it turned out I had inserted a RemoveHash()
function call before the Egtb lookup returned.    So the hash entry was removed
in the wrong place in the program and when it tried to remove it in the correct
place, for the second time, entered an infinite loop caused by the rehash.  (It
counts and stops in debug mode, but it was meant to be working.).

Not the sort of error I anticipated.  And of course, what else have I
inadvertently inserted?

Frank



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.