Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: HW based Crafty (Boule's thesis)

Author: Robert Hyatt

Date: 20:41:05 04/02/02

Go up one level in this thread


On April 01, 2002 at 13:03:26, Keith Evans wrote:

>On April 01, 2002 at 09:46:39, Robert Hyatt wrote:
>
>>On April 01, 2002 at 04:20:57, Tom Kerrigan wrote:
>>
>>>On March 31, 2002 at 16:29:10, Keith Evans wrote:
>>>
>>>>generator for his program MBChess. His thesis is almost complete.  Marc got his
>>>>program from about 2M nps to 10M nps using an FPGA for move ordering and movegen
>>>>only."
>>>>
>>>>Does he really say this? It sounds too good to be true. Let's take a look at
>>>>this from the perspective of the PCI bus. To be able to generate 10M moves per
>>>>second on a 33 MHz PCI bus, Marc seems to be implying that he can complete a PCI
>>>>read in 3.3 cycles and that the master will completely saturate the PCI bus with
>>>
>>>Right, and presumably much more data needs to be transferred than just reading
>>>moves. (I'd like to read the paper but I'm on dial-up right now.) Also, this
>>>assumes the logic can generate moves that fast. Seems like the critical paths
>>>for move generation would be pretty darn long on an FPGA; I'd be surprised if it
>>>could run at ~33MHz. (Not that it would necessarily have to, but I do think
>>>clock speed is a likely bottleneck.) Of course, my biggest problem with the 2M
>>>-> 10M NPS jump is that MBChess must be spending more than 80% of its CPU time
>>>generating and ordering moves, which is way beyond realistic, IMO.
>
>I'm assuming that he got it to run at 33 MHz. He thought about how it would be
>mapped into an FPGA and was careful not to do something that would cause routing
>congestion. The interesting question to me is that given it runs at the 33 MHz
>PCI clock rate and implements the most efficient PCI transfers, then what system
>level performance can be expected out of it, and how much acceleration could it
>possibly produce for a top chess program running on a 1GHz+ x86 class processor.
>(Basically what if a graduate student reimplemented ChipTest as a full custom
>0.18 micron chip and made the changes/improvements proposed in the thesis.)
>
>My belief is that it wouldn't be able to get beyond 5-6 million/moves a second
>in a system, and that it would not accelerate Crafty. My understanding is that
>after a move is generated and returned by the FPGA that Crafty would still need
>to generate hashes, update material counts, handle repetition detection, handle
>50 move counter, update some stacks,... before a move from this FPGA would be
>equivalent to the moves that are currently benchmarked in Crafty.


This depends on what is implemented. IE I suggested including a FPGA
MakeMove() and UnmakeMove() function to take care of the hash updates.

But in the case of "chiptest" it implemented a full alpha/beta tree search,
which means that if this became an FPGA process, you could give the FPGA a
position to search and a depth, and sit and wait for milliseconds while it
does the search.  That would result in minimal PCI traffic...  and let the
hardware run at full FPGA speed, which might be well beyond 30M nodes per
second easily...



>
>Also Crafty would have to use the move ordering given by the FPGA and wouldn't
>be able to use some of the tricks of the trade. Maybe I'm wrong and there's a
>way to instruct the FPGA to make a particular move imposed by software, but this
>is not mentioned in the thesis. If I'm right then we would need to take the
>increased tree size into account to compare performance - just a raw NPS would
>not be meaningful.

Correct.  Some things (killers, etc) would be highly problematic and not
easy to implement.  But basic capture ordering using MVV/LVA is easy to
do in hardware (that is where it was first implemented, in fact, in Belle).


>
>>>
>>
>>
>>It depends on what he is going to do.  IE if he tries to do a full
>>Belle-implementation, then the PCI bus speed is irrelevant.  Because all
>>the move generation, and search stuff is done _inside_ the FPGA hardware,
>>and all that needs to run over the PCI bus is the _result_ of the search.
>>
>>That was how both Belle and Deep Thought/Deep Blue worked...  DT/DB used the
>>VME bus...  Belle used a very slow connection from a PDP 11 to the special-
>>purpose chess hardware...
>>
>>A high bandwidth is not necessarily needed, depending on what is implemented.
>>
>
>If you read the thesis you'll see that it's just a move generator. As far as I
>can tell, you need to perform a minimum of 1 PCI transfer per move. (You can do
>both unmake the last move and return the next more in one read transfer.)
>
>You can read his white paper at: www.macs.ece.mcgill.ca/~mboul
>


This paper, probably correct as I haven't read it.  I was more specifically
talking about how it was done in Belle and then Deep Thought...



>It prompted me to reread some articles on both Belle and Deep Thought. It seems
>to me that if Ken Thompson and Joe H. Condon had been motivated enough then they
>could have removed some inefficiencies from Belle and they would have been
>playing at a higher level than Deep Thought. Their evaluation was far more
>extensive. Did they simply get tired of computer chess?

Deep Thought's evaluation was based directly on Belle's...  Deep Thought
was simply a "belle on a single chip" where belle was a huge batch of FPGA's
in a big box...  Later versions of DT were far stronger than belle, and
then DB had an eval that was a hundred times better than Belle...



>
>I just noticed a discrepancy between how Hsu and Thompson/Condon describe the
>Belle move generator. Thompson/Condon claim that their disable stack is 1-bit
>wide by 64 deep at each square, and Hsu claims that it is 2-bits wide by 64
>deep. I would think that you would need the 2-bits to maintain both the victim
>and aggressor state. Of course this method of storing state is inefficient as
>Hsu found, and you basically just need the last move to regenerate it - but if
>you don't store it at each square, then you require it to be loaded somehow
>which would presumably take time or a decoder and a lot of wires.



I think that was one thing Hsu did in the DT design, eliminate that stack...



>
>Here is what Hsu says in IEEE micro:
>
>   "In the earlier Belle design, a 2-bit disable-stack masks off
>    squares already searched. One disable-stack bit masks off
>    attacker squares already searched for a given victim.
>    The other bit masks off fully searched victims.
>
>    ...
>
>    In the Deep Thought and Deep Blue chess move generators, the
>    last move searched from the position is used to compute the
>    two-bit disable-mask."
>
>Has anyone read any detailed description of Belle _beyond_ what's given
>in:
>
>J.H. Condon and K. Thompson. Belle chess hardware. In Advances in
>Computer Chess 3, pages 45{54. Pergamon Press, Oxford, England, 1982.
>
>J.H. Gondon and Ken Thompson Belle, Chess
>Skill in Man and Machine, 2nd. ed., P.W. Frey,
>ed., Springer-Verlag, 1983, pp. 201-210.
>
>I have those at home, but can't quote from them at the moment. Hsu seems to know
>more about Belle than what's written in those documents. Perhaps this is based
>on conversations that he had with Thompson/Condon.
>
>Regards,
>Keith


Ken was advising Hsu as he developed the "belle on a chip" design...




This page took 0.03 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.