Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: HW based Crafty (Boule's thesis)

Author: Keith Evans

Date: 13:29:10 03/31/02

Go up one level in this thread


On March 30, 2002 at 21:20:39, Slater Wold wrote:

>More information on Marc Boule's thesis can be found at:
>www.macs.ece.mcgill.ca/~mboul
>
>>Regards,
>>Keith

Thank you for posting the URL of Marc Boule's thesis. I read it over last night
and have some questions and comments. On his page he mentions that his thesis
won't be completed until July, so perhaps some of my feedback will be
interesting so him. Are you in contact with him?

You mentioned that "Marc Boule is currently working on an FPGA based move
generator for his program MBChess. His thesis is almost complete.  Marc got his
program from about 2M nps to 10M nps using an FPGA for move ordering and movegen
only."

Does he really say this? It sounds too good to be true. Let's take a look at
this from the perspective of the PCI bus. To be able to generate 10M moves per
second on a 33 MHz PCI bus, Marc seems to be implying that he can complete a PCI
read in 3.3 cycles and that the master will completely saturate the PCI bus with
read cycles directed at his move generator. As far as I can recall the minimum
length PCI read cycle is at least 4 cycles long - this is necessary to avoid bus
contention on the AD bus. If you had back to back reads then you could see
something like the following:

Cycle 1 - Addressing phase, bus master drives address on AD
Cycle 2 - Data turnaround phase - AD goes tristate
Cycle 3 - Data phase - target drives data onto AD
Cycle 4 - Termination - AD goes tristate

I didn't draw the waveforms for the PCI bus, so I can't recall if there would be
other cycles inserted in between the reads. I would be somewhat impressed if
there were a PC which could saturate the PCI bus with non-burst transfers like
that. If you want to get performance out of PCI then you really have to burst.

If we assume that the PC _could_ saturate the bus, then a 4 cycle minimum would
limit Marc to 8.3M moves/sec. If there are any other PCI inefficiencies then
this could easily be reduced to 5M or 6M moves/sec.

For Marc to keep the chip that busy he would have to complete his evaluation in
4 PCI cycles which is 120 ns - on a 1 GHz processor this would be 120 cycles
which isn't much time.

I guess that he's still working on that part of his thesis - I couldn't find any
performance numbers for the whole system in there. It would be interesting if he
could provide some profiling numbers for his program (without H/W move
generator) and some other programs like Crafty and comment on if he got the
acceleration that he expected, and how much he would expect his chip to
accelerate other potentially more evaluation intensive programs. He could also
do some experiments with Crafty - modify its move ordering and see the impact on
the size of the search tree; hopefully somebody else has done this already and
he can refer to their work. He needs to take this into account when calculating
the potential for acceleration. (Hsu discussed this briefly in his thesis.)

One strange thing in his thesis is where he compares the DC I/O levels of PCI
drivers versus LVTTL drivers and says that LVTTL drivers might be faster. One
critical thing about PCI is that you need to look at the dynamic drive
specification for the drivers. If you don't look at the AC switching
characteristics of the drivers then it's impossible to make a comparison. I
suspect that Xilinx does the right thing here, but I'm too lazy to check right
now.

Do you know if the moves that he generates are legal or pseudo-legal? If they
are pseudo-legal then there would be additional overhead. It would be
interesting to add some discussion about this.

I didn't completely understand his piece register bus. Why do the piece select
lines come in pairs? Are they a row select and column select pair? Do you
understand why a capturing move takes an extra cycle? If you view a
non-capturing move as a capture of an empty square then it's strange that there
would be a difference. I understand why en-passant and castling would take an
extra cycle as you need to update more squares. Maybe he's not distinguishing
between the cycles required for making and unmaking captures?

Do you understand how he handles en passant and castling? His "Non-symmetries"
section is interesting but brief. Do you think that he uses the mask bits to
mask pawns not subject to en passant captures? And is there extra logic to make
sure that a king doesn't castle though check? Is the host responsible for
handling promotions - which piece to promote to? Does this require a promotions
stack somewhere?

I'm not sure how the masking logic really works. I know that the Advances in
Computer Chess 3 indicates that Belle's mask is a single bit per square. When
Hsu discussed it in his thesis and IEEE micro article he describes Belle as
having a two bit wide mask per square. I don't understand how you can preserve
all of the move generation state with a 1 bit mask, and since I read Hsu first
I'm sort of stuck with this viewpoint. Since I'm sort of dumb, I would love to
be enlightened.

It would be interesting if he could put in his PCI address map since he's using
address bits as part of a command. I don't really understand the numbers in
table 2 - those can't be PCI cycles but he implies that they are. (PCI cannot do
a 1 cycle bus transaction - you'll only see single cycle transfers in the midst
of a burst.)

He mentions that he doesn't provide readback of the piece registers in order to
save routing. One question that I have is how Hsu's move generator interfaced to
his evaluation. He shows a narrow 19-bit move bus, which sort of implies that he
duplicates the piece registers in his evaluation. Does anybody reading this know
much about Hsu's evaluator? I don't suggest the Marc get into this level of
detail - it's just an interesting question for me. If somebody tries to add eval
then they would need to know about this.

Do you know if he plans to post his VHDL code once his thesis is complete?
Documentation is a pain, so I hope that he releases the code as that would
answer a lot of questions.

I hope that this post doesn't come across as negative - if I didn't find his
thesis interesting then I wouldn't have written such a long post. He's obviously
spent a long time thinking about this and has come up with some intersting ideas
and spent time thinking about how to best exploit the Virtex resources.

Regards,
Keith



This page took 0.03 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.