Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Question to Bob: Crafty , Alpha and FindBit()

Author: Alessandro Damiani

Date: 09:21:56 06/10/98

Go up one level in this thread


On June 10, 1998 at 11:34:37, Vincent Diepeveen wrote:

>
>On June 08, 1998 at 22:50:16, Robert Hyatt wrote:
>
>[about generating speed]
>
>I had something left to post. After we talked at ICC, i got some
>questions
>so i'll put the comparision here too.
>
>At my Pro200 crafty is indeed faster, but that has a different reason:
>i'm running a crafty version which is compiled with microsoft visual
>c++,
>just like my program is, and at your pro you compile under linux, so
>you may be glad that there is a compiler for that platform called GCC.
>
>So i'll use that msvc++ version to compare with.
>I took 2 positions.
>
>First i took openingsposition, as we start the game always over there,
>then i secondly also took BK22 as you liked to see.
>
>Same test also done at a PII-266 EDORAM by Bradley Woodward,
>to see how fast a PII is relatively (intel says PII processor is in fact
>a
>pentium pro processor so we should see the same speed differences)
>to the programs.
>
>Openingsposition Pentium Pro EDO 60ns:
>    Crafty  :  2,103,049
>    Diep    :  3,081,664       46.5% faster
>
>BK22 Pentium Pro EDO 60ns:
>    Crafty  :  2,941,176
>    Diep   :  3,699,552        25.8% faster
>
>Openingsposition PII-266 EDO
>    Crafty  : 2,557,544
>    Diep    : 3,906,250       52.7% faster
>
>BK22 PII-266 EDO
>    Crafty : 3,470,031
>    Diep  :  4,874,446        40.5% faster
>
>Note that i store a move with 4 ints, where crafty stores it in 1 int.
>So i'm doing 4 times more and i do that about 50% faster.
>
>266/200 * 3669552 = 4880504 which equals 4874446 almost.
>266/200 * 3081664 = 4098613 which is almost 200K slower.
>
>So we can say that diep gets faster when Mhz gets faster, where
>crafty has big problems to become faster when Mhz gets faster.
>
>How comes?
>
>Lonnie tested the K6-266 SDRAM
>risc machine to get 3.34M generations a second
>at openings position, and at 300Mhz (100Mhz bus) it gets 3.75M
>generations
>a second.
>
>Still not getting near PII speed, but considering its price a very good
>processor.
>
>>one thing I didn't explain that deserves a comment.  Probably the
>>fastest
>>way I know to generate moves is the "move table" approach, which lets
>>you
>>almost reduce a move generator to a memory-to-memory copy.  which on a
>>PC
>>is not a great thing.  IE on the pentium pro/200mhz machine, a memory
>>read takes about 20 clock cycles (100ns).  Ditto for each memory write.
>>So a memory copy runs at a rate of one word every 200ns, or about 5
>>million words per second, max.
>
>>When we take programs like darkthought or crafty (or other bitmap
>>programs) to a 64 bit architecture, that has a FirstOne() or LeadZ()
>>type instruction, then the bitmap generators get a big boost.. because
>>we only have to do 1/2 the memory traffic (stores only) since the moves
>>come from the 64bit register rather than memory.  And while the CPU is
>>waiting on the slow memory write, it can cycle back thru the loop, and
>>be ready for the next store as soon as the current one finishes
>>(actually,
>>good architectures can handle a backlog of delayed writes, but they
>>eventually fill the delayed write queue and stall.  But the point is,
>>a bitmap program will be able to drive memory at 10M words per second
>>(assuming P6/200 memory speeds) which is 10M moves per second, rather
>>than the read/write copy scheme that will be only 1/2 the speed.
>>
>>Once we get the firstone in hardware so we don't have to waddle off to
>>a procedure for every move...
>
>Note that firstone is already in the hardware instruction at 386.
>Called BSF and BSR.
>
>Speed of BSF/R at pentium pro: 2/3 micro-op and non pairable, where
>normal
>instructions like MOV are 1 micro-op and pairable.
>
>But supposing that this cache-to-register-copy gets as fast as you
>want that firstone to be, then this cache approach always beats you, and
>the facts are there. Please don't tell stories about why i could not
>get my generation table out of cache but need to get it out of slow
>memory,
>caches get bigger and i have done EFFORT to get my tables within L2
>cache, and in the future that'll be in L1 cache too, where in the future
>that firstone of your still is that slow at intel (and suppose BSF and
>BSR
>get faster then you still need the same slow loop to get the squares out
>AFTER you have way more overhead than i did, where i get the squares
>directly).
>
>Problems of bitboards are not the BSF and BSR instructions.
>problem is that you still need that loop to get the squares out and that
>you need to program for EVERY piece, where i with 1 general purpose
>routine can generate all pieces.
>


wrong! pawn moves are generated differently: one shift, one and, and you
have all to-squares at once. So we skip blocked pawns. Pawn captures in
the similar way. Here bitboards are best.


>If i would program for generation speed and store it in 1 int and
>program
>everything for every piece then i'm of course way more faster than the
>current speed diff.
>
>The second problem of bitboards is that you can do only a limited number
>of things with it.
>
>Suppose i have a table with mobility values for it for every square,
>where
>values are generally different, as controlling a1 is not as important
>like controlling d4 is. Quite logical.
>
>Can you show some sample code how you count this with bitboards?
>That's pure horror in my opinion.


very simple (using my board representation: no rotated bitboards):

int MobKnight[64];
int MobBishopDown[64][256];
int MobBishopUp[64][256];
int MobRookHorizontal[64][256];
int MobRookVertical[64][256];

this tables are precomputed and contains the mobility score.

evaluating:

mobility of a knight on square s: MobKnight[s]
mobility of a bishop on square s: MobBishopDown[s][SlideIndexDD(s)]+
				  MobBishopUp[s][SlideIndexDU(s)]

mobility of a rook on square s: MobRookHorizontal[s][SlideIndexH(s)]+
				MobRookVertical[s][SlideIndexV(s)]

no loops! the cost is O(1)!


>
>Further i use in my evaluation attacktables which consists out of a
>number
>of things, like what piece attacks it and the number of attackers.
>
>The number of attackers is something i use a lot in my eval.
>I have this number in an array int A[64], which definitely is in the
>cache.
>So i lookup this with  A[square].
>
>Now i think this is also a major disadvantage of bitboards. To look it
>up with bitboards you need to add all kinds of different things with
>masses
>of 'if'.

why 'if's? to get #attackers to a square the cost is O(1).


>
>I'm not asking to show how you do this in bitboards, that would give you
>nightmares. If you call a function then show the audience how much
>things that function does, JUST TO CALCULATE IT FOR 1 SQUARE,
>where i have all 64 ready.
>
>Where you are wating till you get an alpha for free, i'm waiting for a
>processor
>where branch prediction doesn't take that much systemtime when it
>mispredicts.
>
>I have so many compares in my eval which takes 90% of systemtime, that
>i'm totally depending on the speed of the processor after misprediction.
>
>That would speed up my program several times, where bitboards would
>slow it down another 4 times or so, and take 2 times more code too,
>which
>causes again some extra delay.
>
>Greetings,
>Vincent



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.