Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Question to Bob: Crafty , Alpha and FindBit()

Author: Vincent Diepeveen

Date: 08:34:37 06/10/98

Go up one level in this thread



On June 08, 1998 at 22:50:16, Robert Hyatt wrote:

[about generating speed]

I had something left to post. After we talked at ICC, i got some
questions
so i'll put the comparision here too.

At my Pro200 crafty is indeed faster, but that has a different reason:
i'm running a crafty version which is compiled with microsoft visual
c++,
just like my program is, and at your pro you compile under linux, so
you may be glad that there is a compiler for that platform called GCC.

So i'll use that msvc++ version to compare with.
I took 2 positions.

First i took openingsposition, as we start the game always over there,
then i secondly also took BK22 as you liked to see.

Same test also done at a PII-266 EDORAM by Bradley Woodward,
to see how fast a PII is relatively (intel says PII processor is in fact
a
pentium pro processor so we should see the same speed differences)
to the programs.

Openingsposition Pentium Pro EDO 60ns:
    Crafty  :  2,103,049
    Diep    :  3,081,664       46.5% faster

BK22 Pentium Pro EDO 60ns:
    Crafty  :  2,941,176
    Diep   :  3,699,552        25.8% faster

Openingsposition PII-266 EDO
    Crafty  : 2,557,544
    Diep    : 3,906,250       52.7% faster

BK22 PII-266 EDO
    Crafty : 3,470,031
    Diep  :  4,874,446        40.5% faster

Note that i store a move with 4 ints, where crafty stores it in 1 int.
So i'm doing 4 times more and i do that about 50% faster.

266/200 * 3669552 = 4880504 which equals 4874446 almost.
266/200 * 3081664 = 4098613 which is almost 200K slower.

So we can say that diep gets faster when Mhz gets faster, where
crafty has big problems to become faster when Mhz gets faster.

How comes?

Lonnie tested the K6-266 SDRAM
risc machine to get 3.34M generations a second
at openings position, and at 300Mhz (100Mhz bus) it gets 3.75M
generations
a second.

Still not getting near PII speed, but considering its price a very good
processor.

>one thing I didn't explain that deserves a comment.  Probably the
>fastest
>way I know to generate moves is the "move table" approach, which lets
>you
>almost reduce a move generator to a memory-to-memory copy.  which on a
>PC
>is not a great thing.  IE on the pentium pro/200mhz machine, a memory
>read takes about 20 clock cycles (100ns).  Ditto for each memory write.
>So a memory copy runs at a rate of one word every 200ns, or about 5
>million words per second, max.

>When we take programs like darkthought or crafty (or other bitmap
>programs) to a 64 bit architecture, that has a FirstOne() or LeadZ()
>type instruction, then the bitmap generators get a big boost.. because
>we only have to do 1/2 the memory traffic (stores only) since the moves
>come from the 64bit register rather than memory.  And while the CPU is
>waiting on the slow memory write, it can cycle back thru the loop, and
>be ready for the next store as soon as the current one finishes
>(actually,
>good architectures can handle a backlog of delayed writes, but they
>eventually fill the delayed write queue and stall.  But the point is,
>a bitmap program will be able to drive memory at 10M words per second
>(assuming P6/200 memory speeds) which is 10M moves per second, rather
>than the read/write copy scheme that will be only 1/2 the speed.
>
>Once we get the firstone in hardware so we don't have to waddle off to
>a procedure for every move...

Note that firstone is already in the hardware instruction at 386.
Called BSF and BSR.

Speed of BSF/R at pentium pro: 2/3 micro-op and non pairable, where
normal
instructions like MOV are 1 micro-op and pairable.

But supposing that this cache-to-register-copy gets as fast as you
want that firstone to be, then this cache approach always beats you, and
the facts are there. Please don't tell stories about why i could not
get my generation table out of cache but need to get it out of slow
memory,
caches get bigger and i have done EFFORT to get my tables within L2
cache, and in the future that'll be in L1 cache too, where in the future
that firstone of your still is that slow at intel (and suppose BSF and
BSR
get faster then you still need the same slow loop to get the squares out
AFTER you have way more overhead than i did, where i get the squares
directly).

Problems of bitboards are not the BSF and BSR instructions.
problem is that you still need that loop to get the squares out and that
you need to program for EVERY piece, where i with 1 general purpose
routine can generate all pieces.

If i would program for generation speed and store it in 1 int and
program
everything for every piece then i'm of course way more faster than the
current speed diff.

The second problem of bitboards is that you can do only a limited number
of things with it.

Suppose i have a table with mobility values for it for every square,
where
values are generally different, as controlling a1 is not as important
like controlling d4 is. Quite logical.

Can you show some sample code how you count this with bitboards?
That's pure horror in my opinion.

Further i use in my evaluation attacktables which consists out of a
number
of things, like what piece attacks it and the number of attackers.

The number of attackers is something i use a lot in my eval.
I have this number in an array int A[64], which definitely is in the
cache.
So i lookup this with  A[square].

Now i think this is also a major disadvantage of bitboards. To look it
up with bitboards you need to add all kinds of different things with
masses
of 'if'.

I'm not asking to show how you do this in bitboards, that would give you
nightmares. If you call a function then show the audience how much
things that function does, JUST TO CALCULATE IT FOR 1 SQUARE,
where i have all 64 ready.

Where you are wating till you get an alpha for free, i'm waiting for a
processor
where branch prediction doesn't take that much systemtime when it
mispredicts.

I have so many compares in my eval which takes 90% of systemtime, that
i'm totally depending on the speed of the processor after misprediction.

That would speed up my program several times, where bitboards would
slow it down another 4 times or so, and take 2 times more code too,
which
causes again some extra delay.

Greetings,
Vincent



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.