Author: Colin Frayn
Date: 12:25:04 10/23/02
As most of you know, writing chess programs is fairly tricky. Weird bugs arise to do with NULL move and hashtables that take ages to fix, especially when you get lots of complicated algorithms all working against each other. So imagine my horror when Carlos at Chessbrain.net told me that Beowulf was occasionally returning empty PV strings from searches, indicating that the hashtable was broken (I get the PV from the hashtable directly). This annoyed me especially as I'd not heard of this problem before and I realised I must have broken something recently. I tried to verify this on my PC at home in Windows, but couldn't. Carlos tested it and we discovered that the problem only ever occurred rarely, and only on non-windows boxes. At this stage I was thinking 'compiler error?' or perhaps 'memory leak?' Dreaded memory leaks take ages to find. Anyway, I did a *lot* of testing, with the fabled bug proving exceptionally elusive. Often it would fail for a few goes and then fix itself randomly. One time I was trying to get the bug to work on this one position, and it did so for the majority of the day, and then the following day that position was absolutely fine - no probem at all, whatever I did. I began to wonder if it was something to do with the random number code, which would be seeded differently every time the program was run. I replaced it all. Carlos provided me with a new position that failed and so I started to try the debugging. Eventually I found out that the root position was not being updated properly. I altered the hash replacement scheme, altered the hash update scheme, changed loads of things around in the search function, decided to store the full 64-bit hash key instead of just a 40 bit safe key. basically, I spent ages trying to work out what was wrong, but still no luck. After a *lot* of testing, I finally managed to track the bug down to the fact that the hash key was becoming corrupted at some point during the search. I began to test the DoMove() function, and also a few other things that could have caused this. We installed electric fence and checked for memory leaks. No joy (*sigh of relief*). Somehow the hash key stored in the Board structure (which is continuously updated during DoMove()) had become corrupted so that it didn't correspond to the current board position any more. Then I managed to add in some debugging code which quit as soon as the key stored in the Board structure was no longer correct. I got Beo to print out the position. I couldn't help but notice that in just three ply, black had made one move but white had somehow moved his king about 4 squares. Castling problem? By this time I had also replaced the entire random number generation code, and added in debugging code all over the place to print out messages in case of errors, but I felt I was getting closer. Then I suddenly realised. Carlos had been sending strings to the engine of the form; 3r4/5b2/1k1r1p2/Np5p/4P1p1/2R1KPP1/2P4P/R7 w (i.e. missing off the last two dashes.) The full string should be 3r4/5b2/1k1r1p2/Np5p/4P1p1/2R1KPP1/2P4P/R7 w - - but of course the shortened version should still be valid. One of the positions I had 'fixed' before started working mysteriously after I decided to add back in those two dashes just for neatness. Another started working again for a short while when I cut and pasted the entire line minus the end-of-line character, but then failed again shortly after one I cut out the whole line including the EOL char again. Of course I didn't notice these at the time because I was so fixed on testing my hashtable. So what was the bug? My FEN parser screwed up when it encountered a DoS EOL character whilst running on a non-DoS machine. It ended up interpreting it as a totally spurious castling permission, sometimes allowing players to castle when they shouldn't have been able to and therefore messing everything up, but essentially at random, because the parser was just reading in the string for the FEN past the end, and stopping when it came to a space character. If it met a K,Q,k or q in that space it would update the castling permissions, but of course this was random unallocated memory, so the result of this was also essentially random, meaning that sometimes it failed, and sometimes it didn't. So after all of this, it was just me writing a bad parser ;) How annoying is that? :) Can anyone beat this story? Cheers, Col
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.