Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Question to Bob: Crafty , Alpha and FindBit()

Author: Vincent Diepeveen

Date: 06:00:21 06/11/98

Go up one level in this thread



On June 10, 1998 at 12:21:56, Alessandro Damiani wrote:

>On June 10, 1998 at 11:34:37, Vincent Diepeveen wrote:
>
>>
>>On June 08, 1998 at 22:50:16, Robert Hyatt wrote:
>>
>>[about generating speed]
>>
>>I had something left to post. After we talked at ICC, i got some
>>questions
>>so i'll put the comparision here too.
>>
>>At my Pro200 crafty is indeed faster, but that has a different reason:
>>i'm running a crafty version which is compiled with microsoft visual
>>c++,
>>just like my program is, and at your pro you compile under linux, so
>>you may be glad that there is a compiler for that platform called GCC.
>>
>>So i'll use that msvc++ version to compare with.
>>I took 2 positions.
>>
>>First i took openingsposition, as we start the game always over there,
>>then i secondly also took BK22 as you liked to see.
>>
>>Same test also done at a PII-266 EDORAM by Bradley Woodward,
>>to see how fast a PII is relatively (intel says PII processor is in fact
>>a
>>pentium pro processor so we should see the same speed differences)
>>to the programs.
>>
>>Openingsposition Pentium Pro EDO 60ns:
>>    Crafty  :  2,103,049
>>    Diep    :  3,081,664       46.5% faster
>>
>>BK22 Pentium Pro EDO 60ns:
>>    Crafty  :  2,941,176
>>    Diep   :  3,699,552        25.8% faster
>>
>>Openingsposition PII-266 EDO
>>    Crafty  : 2,557,544
>>    Diep    : 3,906,250       52.7% faster
>>
>>BK22 PII-266 EDO
>>    Crafty : 3,470,031
>>    Diep  :  4,874,446        40.5% faster
>>
>>Note that i store a move with 4 ints, where crafty stores it in 1 int.
>>So i'm doing 4 times more and i do that about 50% faster.
>>
>>266/200 * 3669552 = 4880504 which equals 4874446 almost.
>>266/200 * 3081664 = 4098613 which is almost 200K slower.
>>
>>So we can say that diep gets faster when Mhz gets faster, where
>>crafty has big problems to become faster when Mhz gets faster.
>>
>>How comes?
>>
>>Lonnie tested the K6-266 SDRAM
>>risc machine to get 3.34M generations a second
>>at openings position, and at 300Mhz (100Mhz bus) it gets 3.75M
>>generations
>>a second.
>>
>>Still not getting near PII speed, but considering its price a very good
>>processor.
>>
>>>one thing I didn't explain that deserves a comment.  Probably the
>>>fastest
>>>way I know to generate moves is the "move table" approach, which lets
>>>you
>>>almost reduce a move generator to a memory-to-memory copy.  which on a
>>>PC
>>>is not a great thing.  IE on the pentium pro/200mhz machine, a memory
>>>read takes about 20 clock cycles (100ns).  Ditto for each memory write.
>>>So a memory copy runs at a rate of one word every 200ns, or about 5
>>>million words per second, max.
>>
>>>When we take programs like darkthought or crafty (or other bitmap
>>>programs) to a 64 bit architecture, that has a FirstOne() or LeadZ()
>>>type instruction, then the bitmap generators get a big boost.. because
>>>we only have to do 1/2 the memory traffic (stores only) since the moves
>>>come from the 64bit register rather than memory.  And while the CPU is
>>>waiting on the slow memory write, it can cycle back thru the loop, and
>>>be ready for the next store as soon as the current one finishes
>>>(actually,
>>>good architectures can handle a backlog of delayed writes, but they
>>>eventually fill the delayed write queue and stall.  But the point is,
>>>a bitmap program will be able to drive memory at 10M words per second
>>>(assuming P6/200 memory speeds) which is 10M moves per second, rather
>>>than the read/write copy scheme that will be only 1/2 the speed.
>>>
>>>Once we get the firstone in hardware so we don't have to waddle off to
>>>a procedure for every move...
>>
>>Note that firstone is already in the hardware instruction at 386.
>>Called BSF and BSR.
>>
>>Speed of BSF/R at pentium pro: 2/3 micro-op and non pairable, where
>>normal
>>instructions like MOV are 1 micro-op and pairable.
>>
>>But supposing that this cache-to-register-copy gets as fast as you
>>want that firstone to be, then this cache approach always beats you, and
>>the facts are there. Please don't tell stories about why i could not
>>get my generation table out of cache but need to get it out of slow
>>memory,
>>caches get bigger and i have done EFFORT to get my tables within L2
>>cache, and in the future that'll be in L1 cache too, where in the future
>>that firstone of your still is that slow at intel (and suppose BSF and
>>BSR
>>get faster then you still need the same slow loop to get the squares out
>>AFTER you have way more overhead than i did, where i get the squares
>>directly).
>>
>>Problems of bitboards are not the BSF and BSR instructions.
>>problem is that you still need that loop to get the squares out and that
>>you need to program for EVERY piece, where i with 1 general purpose
>>routine can generate all pieces.
>>
>
>
>wrong! pawn moves are generated differently: one shift, one and, and you
>have all to-squares at once. So we skip blocked pawns. Pawn captures in
>the similar way. Here bitboards are best.

No sir,
you're missing the whole problem of bitboards. How do we get them
out of that bitmap, with from and to-square?

Right you need a loop, which is just as fast as my whole generation is
if you may use BSF/R at 64 bits ints, so then overhead of bitboards
are those AND, OR and other stuff.

Directly this also shows my point: you need to write code for every
piece. Special code for every godda... piece.

Good luck with it.

>>If i would program for generation speed and store it in 1 int and
>>program
>>everything for every piece then i'm of course way more faster than the
>>current speed diff.
>>
>>The second problem of bitboards is that you can do only a limited number
>>of things with it.
>>
>>Suppose i have a table with mobility values for it for every square,
>>where
>>values are generally different, as controlling a1 is not as important
>>like controlling d4 is. Quite logical.
>>
>>Can you show some sample code how you count this with bitboards?
>>That's pure horror in my opinion.
>
>
>very simple (using my board representation: no rotated bitboards):
>
>int MobKnight[64];
>int MobBishopDown[64][256];
>int MobBishopUp[64][256];
>int MobRookHorizontal[64][256];
>int MobRookVertical[64][256];

that's together (assuming 64 bits as you need that with bitboards)
8 bytes x  5 x 64 x 256 = 655 kilobyte, just for bishop knight and rook
mobility.

>this tables are precomputed and contains the mobility score.

Cool. Yet you should make them some gigabytes bigger look to my
explanation
below.

I think the problem is in the 'definition'. What do you want to do with
mobility?

I want to count the relevant squares. You too, but what is a relevant
square? Below you get a chess lesson for free.

>evaluating:
>
>mobility of a knight on square s: MobKnight[s]
>mobility of a bishop on square s: MobBishopDown[s][SlideIndexDD(s)]+
>				  MobBishopUp[s][SlideIndexDU(s)]
>
>mobility of a rook on square s: MobRookHorizontal[s][SlideIndexH(s)]+
>				MobRookVertical[s][SlideIndexV(s)]
>
>no loops! the cost is O(1)!

This doesn't work.

You're still missing my point, or perhaps i was not clear enough, i
assumed
you would see the problem, but it seems whole bunches of programmers
miss
it.

It has to do with some 'chess insight'. I'll give you some chess lessons
for
free:

Suppose we have 2 cases for a rook
at the horizontal axis:
   a)  rook at a1 where we count fields    b1,c1,d1,e1
       it's clear that this works for above representation. this is
probably
      what you have in mind.
   b) rook at a1 where we count fields  b1,d1,e1
     You can't do this. BUT YOU DON'T WANT TO COUNT C1 AS YOU
     DON'T CONTROL IT, for example because there are doubled rooks
     of opponent at c-file which is open.

Second example. Suppose a knight at d4 does control e6 well enough,
but not c6. How do you do this in bitboards, i can't see it different
than that you need to loop them. If you loop them you can count
directly.

The above representation of mobility is from 1950: just counting number
of legal moves multiplied with an identifier for every square.

Some square you DO control, others you DON'T control, although you
attack it.

The better this code, the better your program.

I think i made my point. You can do some
old heuristical crap with bitboards without being that much slower.
In the future you can do it faster than you can do it now.

Yet, Bitboards are already outdated, and currently too slow.

>>Further i use in my evaluation attacktables which consists out of a
>>number
>>of things, like what piece attacks it and the number of attackers.
>>
>>The number of attackers is something i use a lot in my eval.
>>I have this number in an array int A[64], which definitely is in the
>>cache.
>>So i lookup this with  A[square].
>>
>>Now i think this is also a major disadvantage of bitboards. To look it
>>up with bitboards you need to add all kinds of different things with
>>masses
>>of 'if'.
>
>why 'if's? to get #attackers to a square the cost is O(1).

There is difference between my O(1) and bitboard O(1).

In bitboards you need to count for every piece whether it attacks that
point. That are masses of additions.

Theoretical that's O(1), you are correct in that.

For me it's just   n_attackers[square].

Bitboards you need to convert with 'if's everything before adding.
Quite slow.

Horrible slow is a better word.

But You're right it's still O(1).

This has however nothing to do with
algorithmic design, but with overhead & implementation.

>>I'm not asking to show how you do this in bitboards, that would give you
>>nightmares. If you call a function then show the audience how much
>>things that function does, JUST TO CALCULATE IT FOR 1 SQUARE,
>>where i have all 64 ready.
>>
>>Where you are wating till you get an alpha for free, i'm waiting for a
>>processor
>>where branch prediction doesn't take that much systemtime when it
>>mispredicts.
>>
>>I have so many compares in my eval which takes 90% of systemtime, that
>>i'm totally depending on the speed of the processor after misprediction.
>>
>>That would speed up my program several times, where bitboards would
>>slow it down another 4 times or so, and take 2 times more code too,
>>which
>>causes again some extra delay.
>>
>>Greetings,
>>Vincent



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.