Author: Keith Evans
Date: 13:29:10 03/31/02
Go up one level in this thread
On March 30, 2002 at 21:20:39, Slater Wold wrote: >More information on Marc Boule's thesis can be found at: >www.macs.ece.mcgill.ca/~mboul > >>Regards, >>Keith Thank you for posting the URL of Marc Boule's thesis. I read it over last night and have some questions and comments. On his page he mentions that his thesis won't be completed until July, so perhaps some of my feedback will be interesting so him. Are you in contact with him? You mentioned that "Marc Boule is currently working on an FPGA based move generator for his program MBChess. His thesis is almost complete. Marc got his program from about 2M nps to 10M nps using an FPGA for move ordering and movegen only." Does he really say this? It sounds too good to be true. Let's take a look at this from the perspective of the PCI bus. To be able to generate 10M moves per second on a 33 MHz PCI bus, Marc seems to be implying that he can complete a PCI read in 3.3 cycles and that the master will completely saturate the PCI bus with read cycles directed at his move generator. As far as I can recall the minimum length PCI read cycle is at least 4 cycles long - this is necessary to avoid bus contention on the AD bus. If you had back to back reads then you could see something like the following: Cycle 1 - Addressing phase, bus master drives address on AD Cycle 2 - Data turnaround phase - AD goes tristate Cycle 3 - Data phase - target drives data onto AD Cycle 4 - Termination - AD goes tristate I didn't draw the waveforms for the PCI bus, so I can't recall if there would be other cycles inserted in between the reads. I would be somewhat impressed if there were a PC which could saturate the PCI bus with non-burst transfers like that. If you want to get performance out of PCI then you really have to burst. If we assume that the PC _could_ saturate the bus, then a 4 cycle minimum would limit Marc to 8.3M moves/sec. If there are any other PCI inefficiencies then this could easily be reduced to 5M or 6M moves/sec. For Marc to keep the chip that busy he would have to complete his evaluation in 4 PCI cycles which is 120 ns - on a 1 GHz processor this would be 120 cycles which isn't much time. I guess that he's still working on that part of his thesis - I couldn't find any performance numbers for the whole system in there. It would be interesting if he could provide some profiling numbers for his program (without H/W move generator) and some other programs like Crafty and comment on if he got the acceleration that he expected, and how much he would expect his chip to accelerate other potentially more evaluation intensive programs. He could also do some experiments with Crafty - modify its move ordering and see the impact on the size of the search tree; hopefully somebody else has done this already and he can refer to their work. He needs to take this into account when calculating the potential for acceleration. (Hsu discussed this briefly in his thesis.) One strange thing in his thesis is where he compares the DC I/O levels of PCI drivers versus LVTTL drivers and says that LVTTL drivers might be faster. One critical thing about PCI is that you need to look at the dynamic drive specification for the drivers. If you don't look at the AC switching characteristics of the drivers then it's impossible to make a comparison. I suspect that Xilinx does the right thing here, but I'm too lazy to check right now. Do you know if the moves that he generates are legal or pseudo-legal? If they are pseudo-legal then there would be additional overhead. It would be interesting to add some discussion about this. I didn't completely understand his piece register bus. Why do the piece select lines come in pairs? Are they a row select and column select pair? Do you understand why a capturing move takes an extra cycle? If you view a non-capturing move as a capture of an empty square then it's strange that there would be a difference. I understand why en-passant and castling would take an extra cycle as you need to update more squares. Maybe he's not distinguishing between the cycles required for making and unmaking captures? Do you understand how he handles en passant and castling? His "Non-symmetries" section is interesting but brief. Do you think that he uses the mask bits to mask pawns not subject to en passant captures? And is there extra logic to make sure that a king doesn't castle though check? Is the host responsible for handling promotions - which piece to promote to? Does this require a promotions stack somewhere? I'm not sure how the masking logic really works. I know that the Advances in Computer Chess 3 indicates that Belle's mask is a single bit per square. When Hsu discussed it in his thesis and IEEE micro article he describes Belle as having a two bit wide mask per square. I don't understand how you can preserve all of the move generation state with a 1 bit mask, and since I read Hsu first I'm sort of stuck with this viewpoint. Since I'm sort of dumb, I would love to be enlightened. It would be interesting if he could put in his PCI address map since he's using address bits as part of a command. I don't really understand the numbers in table 2 - those can't be PCI cycles but he implies that they are. (PCI cannot do a 1 cycle bus transaction - you'll only see single cycle transfers in the midst of a burst.) He mentions that he doesn't provide readback of the piece registers in order to save routing. One question that I have is how Hsu's move generator interfaced to his evaluation. He shows a narrow 19-bit move bus, which sort of implies that he duplicates the piece registers in his evaluation. Does anybody reading this know much about Hsu's evaluator? I don't suggest the Marc get into this level of detail - it's just an interesting question for me. If somebody tries to add eval then they would need to know about this. Do you know if he plans to post his VHDL code once his thesis is complete? Documentation is a pain, so I hope that he releases the code as that would answer a lot of questions. I hope that this post doesn't come across as negative - if I didn't find his thesis interesting then I wouldn't have written such a long post. He's obviously spent a long time thinking about this and has come up with some intersting ideas and spent time thinking about how to best exploit the Virtex resources. Regards, Keith
This page took 0.03 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.