Computer Chess Club Archives


Search

Terms

Messages

Subject: Most obscure bug *ever*

Author: Colin Frayn

Date: 12:25:04 10/23/02


As most of you know, writing chess programs is fairly tricky.  Weird bugs arise
to do with NULL move and hashtables that take ages to fix, especially when you
get lots of complicated algorithms all working against each other.

So imagine my horror when Carlos at Chessbrain.net told me that Beowulf was
occasionally returning empty PV strings from searches, indicating that the
hashtable was broken (I get the PV from the hashtable directly).  This annoyed
me especially as I'd not heard of this problem before and I realised I must have
broken something recently.

I tried to verify this on my PC at home in Windows, but couldn't.  Carlos tested
it and we discovered that the problem only ever occurred rarely, and only on
non-windows boxes.  At this stage I was thinking 'compiler error?' or perhaps
'memory leak?'  Dreaded memory leaks take ages to find.

Anyway, I did a *lot* of testing, with the fabled bug proving exceptionally
elusive.  Often it would fail for a few goes and then fix itself randomly.  One
time I was trying to get the bug to work on this one position, and it did so for
the majority of the day, and then the following day that position was absolutely
fine - no probem at all, whatever I did.  I began to wonder if it was something
to do with the random number code, which would be seeded differently every time
the program was run.  I replaced it all.

Carlos provided me with a new position that failed and so I started to try the
debugging.  Eventually I found out that the root position was not being updated
properly.  I altered the hash replacement scheme, altered the hash update
scheme, changed loads of things around in the search function, decided to store
the full 64-bit hash key instead of just a 40 bit safe key.  basically, I spent
ages trying to work out what was wrong, but still no luck.

After a *lot* of testing, I finally managed to track the bug down to the fact
that the hash key was becoming corrupted at some point during the search.  I
began to test the DoMove() function, and also a few other things that could have
caused this.  We installed electric fence and checked for memory leaks.  No joy
(*sigh of relief*).  Somehow the hash key stored in the Board structure (which
is continuously updated during DoMove()) had become corrupted so that it didn't
correspond to the current board position any more.

Then I managed to add in some debugging code which quit as soon as the key
stored in the Board structure was no longer correct.  I got Beo to print out the
position.  I couldn't help but notice that in just three ply, black had made one
move but white had somehow moved his king about 4 squares.  Castling problem?
By this time I had also replaced the entire random number generation code, and
added in debugging code all over the place to print out messages in case of
errors, but I felt I was getting closer.

Then I suddenly realised.  Carlos had been sending strings to the engine of the
form;
3r4/5b2/1k1r1p2/Np5p/4P1p1/2R1KPP1/2P4P/R7 w

(i.e. missing off the last two dashes.)  The full string should be
3r4/5b2/1k1r1p2/Np5p/4P1p1/2R1KPP1/2P4P/R7 w - -

but of course the shortened version should still be valid.  One of the positions
I had 'fixed' before started working mysteriously after I decided to add back in
those two dashes just for neatness.  Another started working again for a short
while when I cut and pasted the entire line minus the end-of-line character, but
then failed again shortly after one I cut out the whole line including the EOL
char again.  Of course I didn't notice these at the time because I was so fixed
on testing my hashtable.

So what was the bug?  My FEN parser screwed up when it encountered a DoS EOL
character whilst running on a non-DoS machine.  It ended up interpreting it as a
totally spurious castling permission, sometimes allowing players to castle when
they shouldn't have been able to and therefore messing everything up, but
essentially at random, because the parser was just reading in the string for the
FEN past the end, and stopping when it came to a space character.  If it met a
K,Q,k or q in that space it would update the castling permissions, but of course
this was random unallocated memory, so the result of this was also essentially
random, meaning that sometimes it failed, and sometimes it didn't.

So after all of this, it was just me writing a bad parser ;)

How annoying is that?  :)
Can anyone beat this story?

Cheers,
Col



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.