Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: What Would it take to Hardware-ize Crafty?

Author: Robert Hyatt
Date: 13:20:23 02/18/04
On February 18, 2004 at 16:08:32, Bob Durrett wrote:

>On February 18, 2004 at 13:47:04, Robert Hyatt wrote:
>
>>On February 17, 2004 at 16:56:46, Bob Durrett wrote:
>>
>>>
>>>If one ignores the many anticipated objections to doing this research, then what
>>>would it take?
>>>
>>>I can see the following process:
>>>
>>>(1)  A list of blocks of code which could be "hardware-ized" is made and the
>>>list ordered in order of decreasing expected benefit.
>>>
>>>(2)  Then, the powers that be would decide which blocks to do that to.
>>>
>>>(3)  Then it would be done.  Testing and tuning would follow.
>>>
>>>(4)  Then a match would be set up with Kasparov.
>>>
>>>(5)  Etceteras.
>>>
>>>By "hardware-izing" I mean creating a chip which could input it's input data,
>>>process that data, and output that data all within the time of a few clock
>>>cycles of a modern microprocessor chip.
>>>
>>>QUESTIONS:
>>>
>>>Which block of Crafty code would be chosen first?  How much benefit to expect?
>>>
>>>Same questions for second block of code.
>>>
>>>Etceteras for the rest of the code until nothing is left.
>>>
>>>Bob D.
>>
>>
>>We talked about this a couple of years ago.  Fortunately, I have been around
>>long enough to understand the original Belle chess engine hardware design, and
>>how it was "cloned" into DT/DB (with a subtle change to make the chip smaller).
>>
>>With that in mind, crafty really doesn't fit a hardware implementation.  IE much
>>of what I do would be _very_ difficult to do in hardware.  It is the right way
>>to do things when memory is handy, but not when trying to implement the thing
>>directly in hardware.  For example, SEE is not a nice hardware topic because it
>>is a variable-length process and hardware doesn't like that.  Hardware
>>implementations use the old Belle MVV/LVA approach to ordering captures instead,
>>which is ok, but not nearly as good as SEE.  I generate moves and stuff 'em into
>>an array.  Hardware doesn't fit that very well, and the belle approach is
>>generally used where you generate one move, and make it and go on, then later
>>you come back and generate another move, and make it and so forth.  Dark Thought
>>implemented the move generator this way, and I sometimes wish I had as it does
>>offer some efficiencies, but it also offers problems with the inability to use a
>>history heuristic, for example.
>>
>>The things that would help Crafty would be special purpose hardware instructions
>>that do the following:
>>
>>Make/Unmake moves.  This special purpose hardware would keep the bitboards
>>internally in dedicated special-purpose registers, and do the updating needed as
>>asked by the software when making/unmaking a move.  The software time for this
>>would then vanish.
>>
>>generate captures/generate non-captures/ generate check-evasions.  It would be
>>nice to have an instruction that when executed, dumps a list of moves ready to
>>go, with no real overhead.  The move generator time would basically vanish.
>>
>>A Hardware SEE would be nice, and is doable, but doing it in the middle of a
>>hardware search would be messy.  But a separate SEE instruction would simply
>>make the cost of that vanish as well.
>>
>>The above add up to less than 50% of the total search time.  Making them
>>"vanish" would make crafty possibly 2x faster.  The last 50% is way tougher as
>>it is the evaluation code, and it consumes 50%+ of the total search time at
>>present.  It could obviously be moved into hardware, and much of it would fit
>>very well since it is bitmap pattern matching.  Some of it would be very
>>difficult (blocked pawn code since I look to see where all pawns can advance
>>safely to, and that is hard to do in a fixed number of clock cycles).  But if we
>>did this part in hardware, then we _might_ hit speeds 10x faster or close to it,
>>since all that is left is just the search overhead itself.
>>
>>Ken's pre-1980 version of Belle did just this.  This is the version of Belle
>>that first played in 1978 and was phased out in 1980 when he did the "whole
>>deal" in hardware.  Doing this 1978 version produced a program that was 2x
>>faster than the closest competition, chess 4.7.  Chess 4.7 searched about 2600
>>nodes per second, 1978 belle hit 5000, but don't forget that the software part
>>was running on a _very_ slow minicomputer (a pdp 11/44).
>>
>>Making Crafty go perhaps 10x faster would be nice.  It would take 2+ years to
>>get it all done and tested, probably.  By then hardware has caught up somewhat
>>and that 10x has dropped to 2x-3x.  That's the problem with hardware.  And it is
>>why I am not that excited about Hydra.
>>
>>Some numbers:
>>
>>in 1975 chess 4.0 showed up at the ACM event on a CDC 7600 (Cyber 176).  They
>>were searching 10x faster than anyone else.  IE they were doing 2600 NPS on that
>>box, while the rest of us were struggling to hit 100, untill you happened to
>>find an IBM 360/91 or something laying around where you might hit 250-300 nps.
>>10x proved to be a dominating advantage, as their early ACM history shows.
>>
>>In 1978 Ken caught up and actually passed them by a factor of 2, but that factor
>>of 2 just made the games interesting, belle was not dominant.
>>
>>in 1980 we had just moved to the cray and started at about 1K nodes per second,
>>but eventually passed chess 4.x's 2.6K.  But Belle showed up with a new hardware
>>design that was hitting 160K.  80x faster than chess 4.x and us.  And it was a
>>dominant box for the next three years.  Finally, by 1983 we were doing 20K on
>>the XMP-2 and that was enough to put us back in the game, as our search was more
>>efficient for lots of reasons such as no hash-move for ordering, no SEE in
>>Belle, no killer/history ordering.  Etc.  We steadily pulled ahead through 1986
>>where we were doing 200K+ and the next closest thing was Belle but it was a ply
>>behind us then with our better software search.
>>
>>In 1987 along came chiptest/deep thought, and their speed improved by leaps and
>>bounds, quickly hitting 2-4M by the 1989 WCCC.  And again, with that speed
>>advantage, they dominated computer chess thru their last event in 1995, losing
>>only one game I recall, to Fritz in Hong Kong.
>>
>>Hydra doesn't represent that remarkable a speed jump yet.  IE they claim 3M
>>nodes per second per board, but something like 60% search efficiency, so back
>>that off to 2M x 8 board which is 16M nodes per second.  I just ran on a machine
>>in CCT6 that was doing between 8 and 10M nodes per second, and I have tested on
>>an 8-way opteron that matches Hydra's speed pretty well, yet I don't have any of
>>the typical hardware shortcomings, such as their lack of hashing in the hardware
>>and so forth.
>>
>>That is why I said that it doesn't impress me that much.  It's an interesting
>>project, and it produces some interesting speed, but it is not way out ahead of
>>what anyone could produce on a simple 8-way opteron, much less going to 16-way
>>and beyond.  IE had it come out 3-4 years ago, it might have been a dominating
>>force for a couple of years, but today it simply isn't.
>>
>>Now if they put together a new version with faster FPGAs, or many more FPGAs,
>>then this might change.  But I am evaluating reality not vaporware.  Until I see
>>such a beast, I don't have to play it and won't worry about it.  Meanwhile, I
>>know what current 64 bit hardware can do for me, and the result is _not_
>>something to be taken lightly.  CCT6 showed that...
>
>A most interesting set of comments and review of past experience. Very
>informative.
>
>Two questions stand out:
>
>(1) What's the problem with hash [and maybe also the other things referred to],
>and

Hashing requires a memory access.  With a distributed engine (Hydra uses 4 pcs
with two FPGA cards per PC as I understand it), you have an impossibility.  You
need a global hash table, but using network speeds makes it difficult to do.
You need a shared hash table between the two cards on the PCI bus and the
software program.  But accessing memory introduces a long delay in a card that
is counting clock cycles carefully.

Hsu implemented the logic for hash tables in the last DB chip (DB2).  But he
didn't have time to design and build the 16-way multi-ported memory required so
that 16 chess processors could share the hash table.

The easy solution is to just give up.  Note that Belle didn't do this, and
actually hashed in the hardware search.  So it can be done, but for some reason,
the Hydra team chose to forget about it.  Most likely they already have problems
with PCI bandwidth/latency, and adding a lot of memory reference traffic (one
probe per node searched) might swamp the PCI bus and produce a terrible backlog.



>
>(2)  Why is it assumed that the hardware would not advance too?  It is true that
>an individual piece of hardware may not change, but new versions might get
>better.  Consider the case of video and audio cards in PCs.  They are getting
>better by leaps and bounds!
>


The hardware advances, but it also changes.  This means that what you did last
year has to be modified before it will run on this year's hardware.  That's why
assembly language programs are so miserable to deal with...  A minor change to
the hardware might well require a major re-design of the application that runs
on that hardware.

Just watch the Hydra project to see what I mean.  You won't see a steady
roll-out of new PCI cards as the PCI goes faster, new firmware as the FPGAs
become faster or more capable, etc...



>Bob D.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.