Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Question to Bob: Crafty , Alpha and FindBit()

Author: Robert Hyatt

Date: 19:50:16 06/08/98

Go up one level in this thread


On June 08, 1998 at 11:33:50, Robert Hyatt wrote:

>On June 08, 1998 at 11:18:54, Vincent Diepeveen wrote:
>
>>On June 08, 1998 at 11:01:11, Robert Hyatt wrote:
>>
>>>On June 08, 1998 at 10:34:23, Vincent Diepeveen wrote:
>>>
>>>>On June 08, 1998 at 07:37:44, Robert Hyatt wrote:
>>>>
>>>>>On June 08, 1998 at 07:17:08, Vincent Diepeveen wrote:
>>>>>
>>>>>>
>>>>>>On June 05, 1998 at 21:48:57, Robert Hyatt wrote:
>>>>>>
>>>>>>>On June 05, 1998 at 17:47:07, Bruce Moreland wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>On June 05, 1998 at 08:53:53, Ernst A. Heinz wrote:
>>>>>>>>
>>>>>>>>>BTW, we have been waiting to get our hands on a Alpha-21264 for roughly
>>>>>>>>>a year because "DarkThought" will hopefully speed up as much on them
>>>>>>>>>as "Crafty".
>>>>>>>>
>>>>>>>>What percentage of execution time do you and Bob spend doing these
>>>>>>>>operations, do you think?netscape -install
>>>>
>>>>>>>>
>>>>>>>>bruce
>>>>>>>
>>>>>>>FindFirstOne() = 1-2% in crafty.  popcnt is much less...  so that won't
>>>>>>>help a whole lot.  but the 64 bit stuff helps in general...
>>>>>>
>>>>>>Well that doesn't say anything concrete.
>>>>>>
>>>>>>More important is the overall delay.
>>>>>>
>>>>>>At a P133 intel the speed difference between crafty's move generator
>>>>>>and diep's move generator is somewhat more than 3 times.
>>>>>>
>>>>>>Crafty is more than 3 times slower, because of bitboards, which perform
>>>>>>badly at 32 bits. Note that the more possibilities, this 3 times will
>>>>>>become
>>>>>>4 times.
>>>>>>
>>>>>>For something taking 1 clockcycle in my program and which can
>>>>>>be even pipelined in U + V pipe,
>>>>>>crafty needs sometimes even a whole function.
>>>>>>
>>>>>>Greetings,
>>>>>>Vincent
>>>>>
>>>>>
>>>>>yes... but try to generate nothing but captures and compare that, which
>>>>>is a major part of the total tree search...
>>>>
>>>>Where do i need to generate captures for?
>>>>
>>>>But to answer your question: 2.5 times faster.
>>>
>>>
>>>Because in ordering alpha/beta moves, captures are searched first.  And
>>
>>this statement is already naive.
>>
>>I search a move that gives a cutoff. Sometimes a capture, sometimes a
>>check,
>>sometimes a promotion, depending on what move is gonna give me the
>>cutoff.
>>
>>I think we directly here see why Diep's branching factor is that well.
>>I don't have a special order. I pick simply the move that is very
>>probably gonna give me a cutoff.
>>
>>>they cause most of the cutoffs.  If you just produce captures, you don't
>>>have to continually wade through the non-captures picking out capture
>>>moves to try...
>>>
>>>and the best test to try is to set up a position in Crafty, and use the
>>>"perf" command...  set up a position such as kopec22 which is a
>>>reasonable
>>>early middlegame position, and type "perf".  On a single processor
>>>pentium
>>>pro, that produces the following:
>>
>>We cannot compare our processors. I have a P133 at the same speed,
>>but i don't have a Pentium pro SDRAM with 512 kb cache at a fast
>>mainboard,
>>like you have.
>>
>
>I don't have SDRAM nor 512KB cache.  My ALR uses FPM memory and each
>cpu has 256K of cache... just like yours probably... except your memory
>is probably a bit faster (EDO) vs my fast page mode memory.  But my
>Orion
>chipset for 4 processors only supports FPM for interleaving...
>
>>I have a P133 laptop with a processor which is exactly 133Mhz,
>>and i have a pro 200Mhz 256kb cache with edoram, which is way slower
>>than
>>your pro.
>
>Wrong, as I said... Yours is actually a little faster, as my old Gateway
>with EDO ram was a couple of percent faster than my current machine...
>
>
>>  My datastructure doesn't fit within 256kb cache, except when
>>i'm
>>just generating moves... ...so that'll be a fair compare except for the
>>SDRAM.
>>I'm not sure what influence it has on just generating moves over and
>>over again.
>>
>>>Black(1): perf
>>>generated 3300000 moves, time=1.38 seconds
>>>generated 2391304 moves per second
>>>generated/made/unmade 3300000 moves, time=4.94 seconds
>>>generated/made/unmade 668016 moves per second
>>
>>We must not forget that if i generate moves (3 times faster), that
>>i store an entire structure for every move, where you only do
>>
>>*move++ = int;
>>
>>We should take this into account.
>
>no reason...  You said "you can generate moves 3x faster".  I gave you
>a number of moves I can generate per second.  If you can't generate them
>that fast, then you aren't 3x faster.  I do things you don't do... you
>do things I don't do.  But we both generate moves...
>
>
>
>
>>
>>I'm storing a lot of different stuff with a move which gets used in the
>>move ordering and evaluation.
>>
>>I'm now here under UNIX at a slow HP9000 model 712 (60Mhz).
>>When i'm at home i do this test special for you, where i shall replace
>>that
>>storing of datastructure.
>>
>>>IE I can generate all moves from that position for white, at 2.4M moves
>>>per second, or I can generate all the moves, and then make each one, and
>>>do that at 670K moves per second...
>>
>>>What's your speed?  here is the position, diagram and FEN:
>>
>>We should compare branching factor, that would be nice.
>>You may use your 512MB SDRAM 4 processor against my pro200,
>>also nice would be to do this generating of moves at a K6 or P133, which
>>clearly will show that factor 3 to 4.
>
>
>you keep saying that... but you are wrong.  Again, no SDRAM, which
>wouldn't
>matter anyway until someone gets a BX motherboard that can use the
>faster
>SDRAM cycle time for the second through fourth words of memory.  And I
>don't
>have 512K of cache... they were too expensive when I bought this
>machine.
>
>
>
>
>>
>>>Black(1): d
>>>
>>>       +---+---+---+---+---+---+---+---+
>>>    8  |   |   | *R|   |   | *R| *K|   |
>>>       +---+---+---+---+---+---+---+---+
>>>    7  |   | *B| *Q| *N| *B| *P| *P|   |
>>>       +---+---+---+---+---+---+---+---+
>>>    6  |   | *P|   | *P| *P| *N|   | *P|
>>>       +---+---+---+---+---+---+---+---+
>>>    5  | *P| P |   |   |   |   |   |   |
>>>       +---+---+---+---+---+---+---+---+
>>>    4  | N |   | P |   | P |   |   |   |
>>>       +---+---+---+---+---+---+---+---+
>>>    3  | P |   |   | B |   | N |   | P |
>>>       +---+---+---+---+---+---+---+---+
>>>    2  |   | B |   |   | Q | P | P |   |
>>>       +---+---+---+---+---+---+---+---+
>>>    1  | R |   |   | R |   |   | K |   |
>>>       +---+---+---+---+---+---+---+---+
>>>         a   b   c   d   e   f   g   h
>>>
>>
>>Cool this has a lot of possibilities. That'll go with many many million
>>moves
>>a second!
>>
>
>let's see your numbers.  But you can't take anything out.  Otherwise
>I have a lot of stuff I'll take out too, particularly in MakeMove()
>which
>is the second set of numbers above...  And use your pentium pro since it
>is
>only a touch faster than mine, then we can compare real numbers...
>
>
>
>>>2r2rk/1bqnbpp/1p1ppn1p/pP/N1P1P/P2B1N1P/1B2QPP/R2R2K b


one thing I didn't explain that deserves a comment.  Probably the
fastest
way I know to generate moves is the "move table" approach, which lets
you
almost reduce a move generator to a memory-to-memory copy.  which on a
PC
is not a great thing.  IE on the pentium pro/200mhz machine, a memory
read takes about 20 clock cycles (100ns).  Ditto for each memory write.
So a memory copy runs at a rate of one word every 200ns, or about 5
million words per second, max.

When we take programs like darkthought or crafty (or other bitmap
programs) to a 64 bit architecture, that has a FirstOne() or LeadZ()
type instruction, then the bitmap generators get a big boost.. because
we only have to do 1/2 the memory traffic (stores only) since the moves
come from the 64bit register rather than memory.  And while the CPU is
waiting on the slow memory write, it can cycle back thru the loop, and
be ready for the next store as soon as the current one finishes
(actually,
good architectures can handle a backlog of delayed writes, but they
eventually fill the delayed write queue and stall.  But the point is,
a bitmap program will be able to drive memory at 10M words per second
(assuming P6/200 memory speeds) which is 10M moves per second, rather
than the read/write copy scheme that will be only 1/2 the speed.

Once we get the firstone in hardware so we don't have to waddle off to
a procedure for every move...



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.