Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: HW based Crafty (Boule's thesis)

Author: Robert Hyatt

Date: 07:01:10 04/09/02

Go up one level in this thread


On April 08, 2002 at 23:58:40, Keith Evans wrote:

>On April 08, 2002 at 23:11:02, Robert Hyatt wrote:
>
>>On April 08, 2002 at 22:35:00, Keith Evans wrote:
>>
>>>On April 08, 2002 at 12:11:53, Robert Hyatt wrote:
>>>
>>>>On April 08, 2002 at 04:48:01, Keith Evans wrote:
>>>>
>>>>I tend to agree...  doing a full "engine" in an FPGA would be a "task".  The
>>>>thing that seemed most interesting to me was to take the bitmap stuff and tuck
>>>>it all into an FPGA so that we get special-purpose "instructions" like
>>>>"makemove", "unmakemove", "generate moves" and perhaps even a Swap() function
>>>>and InCheck function.  That would certainly run the search speed up by a
>>>>factor of at least two, which would be very significant.  And it wouldn't
>>>>give up anything at all since the actual search, hashing, etc would be done
>>>>in software, but the bitmap/hash-signature update stuff would be done in
>>>>hardware.
>>>>
>>>
>>>I don't really have a good feel for what would be involved in this, but it seems
>>>to me that if you are considering the use of an FPGA then we need to take a
>>>radically different approach to achieve a performance gain over these 2 GHz CPUs
>>>- the routing delays in FPGAs are huge compared to 1/2 ns.
>>
>>Yes... but don't forget the memory cost is not .5 ns but in the 100ns range,
>>so that is pretty large...
>
>I'm not sure how much data would need to be transferred to a hypothetical
>bitboard coprocessor. My concern would be that if you were to put this on a PCI
>device then each access would take at least 120 ns if you're not bursting and
>you're on a 32-bit 33 MHz PCI bus. This is a context switch for my brain and I'm
>totally ignorant of how Crafty works here - I have no idea how the bus cycles
>could/would look. At least "generate moves" should burst. I don't think that you
>would want a descriptor based DMA engine - just let the CPU pull data.


I based my suggestion on the fact that a single memory reference takes
100ns or so...  and it is very likely that _somewhere_ inside MakeMove()
there will be at least one and most likely a _bunch_...  so 120ns doesn't
sound bad at all...  Ditto for UnMakeMove() and the other functions we were
talking about..





>
>This would be a lot easier project to take on than a full engine. The promise of
>an engine that could deliver greater than a 2X increase is exciting, but perhaps
>too difficult a place to begin. If something like your bitboard accelerator
>makes sense on paper, then it might be a lot better way for Slater to come up to
>speed, before deciding if he wants to pursue it to the extreme.
>
>>>
>>>I know that your bitmap approach uses a lot of lookup tables, but I'm not sure
>>>that those could be put on an FPGA or even that if that would have to be done on
>>>the sort of chip that you're talking about. If they would be required - then can
>>>you tell me how large they are?
>>
>>
>>For bishops (assuming the "compact_attacks" version) I need 2 x 296 x 8 bytes.
>>For normal attacks, it is 2 x 64 x 256 x 8 bytes...  I normally use the
>>compact version on intel machines which greatly reduces memory traffic and
>>cache thrashing.
>>
>
>I went and fetched my Virtex datasheet. I like to use the Virtex 1000E part for
>this kind of thing because I know that it's readily available, and there are
>some interesting proto cards that use it. Depending upon how something like this
>develops it might be more appropriate to choose an extended RAM part.
>
>It has 393,216 bits of BlockRAM. This consists of 96 RAMs which are each
>independently organized as one of {4096x1,2048x2,...,256x16} Assuming that the 8
>bytes is referring to a 64-bit wide memory organization and that the 2 is
>referring to two separate ROMs you could build two 512x64 RAMs using the 512x8
>configuration which would eat up 16 Block Select RAMs. These RAMs are organized
>into columns and this would eat up one column out of 6. It's a little wasteful
>due to your 296 number and the lower limit on the RAM depth configuation.
>
>So how many more of these tables are there...?

That's a good question.  But first, a more important question...  what would
be the _best_ way to do this in hardware?  IE lookup tables are the best way
(so far) for software...  But for hardware things would probably be
different.  IE it would be possible to rotate the bitboards in hardware
rather than storing them already rotated...  This probably needs some thought
as it is not always the best idea to take a software algorithm and stuff it into
hardware as-is...



>
>One nice thing about a bitboard accelerator is that it could probably be
>pipelined and clocked reasonably fast. One problem is that the placement of the
>RAMs is fixed - so routing delays may be large depending on how the tables are
>used.
>
>Regards,
>Keith



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.