Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Question to Bob: Crafty , Alpha and FindBit()

Author: Robert Hyatt

Date: 12:57:49 06/10/98

Go up one level in this thread


On June 10, 1998 at 11:34:37, Vincent Diepeveen wrote:

>
>On June 08, 1998 at 22:50:16, Robert Hyatt wrote:
>
>[about generating speed]
>
>I had something left to post. After we talked at ICC, i got some
>questions
>so i'll put the comparision here too.
>
>At my Pro200 crafty is indeed faster, but that has a different reason:
>i'm running a crafty version which is compiled with microsoft visual
>c++,
>just like my program is, and at your pro you compile under linux, so
>you may be glad that there is a compiler for that platform called GCC.
>
>So i'll use that msvc++ version to compare with.
>I took 2 positions.
>
>First i took openingsposition, as we start the game always over there,
>then i secondly also took BK22 as you liked to see.
>
>Same test also done at a PII-266 EDORAM by Bradley Woodward,
>to see how fast a PII is relatively (intel says PII processor is in fact
>a
>pentium pro processor so we should see the same speed differences)
>to the programs.
>

A pentium II is not as good as a pentium pro.  The Ppro L2 cache runs at
processor clock cycle time, while the PII L2 cache runs at 1/2  the
processor clock cycle time, which is a measurable penalty for the PII.


>Openingsposition Pentium Pro EDO 60ns:
>    Crafty  :  2,103,049
>    Diep    :  3,081,664       46.5% faster
>
>BK22 Pentium Pro EDO 60ns:
>    Crafty  :  2,941,176
>    Diep   :  3,699,552        25.8% faster
>
>Openingsposition PII-266 EDO
>    Crafty  : 2,557,544
>    Diep    : 3,906,250       52.7% faster
>
>BK22 PII-266 EDO
>    Crafty : 3,470,031
>    Diep  :  4,874,446        40.5% faster
>
>Note that i store a move with 4 ints, where crafty stores it in 1 int.
>So i'm doing 4 times more and i do that about 50% faster.

wrong.  Note that my FirstOne() function does between 1 to 4 memory
reads
to find the first one bit...  which will totally go away on the new
alpha
processor since it will be done in hardware just like it is done on the
Cray...



>
>266/200 * 3669552 = 4880504 which equals 4874446 almost.
>266/200 * 3081664 = 4098613 which is almost 200K slower.
>
>So we can say that diep gets faster when Mhz gets faster, where
>crafty has big problems to become faster when Mhz gets faster.
>
>How comes?
>

FirstOne() is the culprit..  On a new machine, our move generators break
down like this:  You do four loads and 4 stores per move, assuming you
are
using 4 ints everywhere, including your move table.  Crafty is going to
do
only one store per move, plus 1,2,3 or 4 memory loads per piece that it
generates moves for.  My bandwidth is therefore going to be something
like
1.5 memory accesses per move, average, which will be quite fast.  At
present, for every move I generate, I have to call FirstOne() which
looks
up 16 bit chunks of the 64bit bitmap in a large array.  Which means most
of
my memory bandwidth in the move generator is FirstOne() rather than
simply
pumping moves to memory.

At present, this stresses the PC bandwidth, and, in particular, the
grossly
bad L2 cache on the PII, so that each of my table lookups, even when
they
hit cache, take twice as long to satisfy as they do on the pentium pro.
The
PII is just not a good design at present, when compared to the Pro...

This will change when the slot-2 PII hits the street later this year
with
a cpu-clock speed cache once again...




>Lonnie tested the K6-266 SDRAM
>risc machine to get 3.34M generations a second
>at openings position, and at 300Mhz (100Mhz bus) it gets 3.75M
>generations
>a second.
>
>Still not getting near PII speed, but considering its price a very good
>processor.
>
>>one thing I didn't explain that deserves a comment.  Probably the
>>fastest
>>way I know to generate moves is the "move table" approach, which lets
>>you
>>almost reduce a move generator to a memory-to-memory copy.  which on a
>>PC
>>is not a great thing.  IE on the pentium pro/200mhz machine, a memory
>>read takes about 20 clock cycles (100ns).  Ditto for each memory write.
>>So a memory copy runs at a rate of one word every 200ns, or about 5
>>million words per second, max.
>
>>When we take programs like darkthought or crafty (or other bitmap
>>programs) to a 64 bit architecture, that has a FirstOne() or LeadZ()
>>type instruction, then the bitmap generators get a big boost.. because
>>we only have to do 1/2 the memory traffic (stores only) since the moves
>>come from the 64bit register rather than memory.  And while the CPU is
>>waiting on the slow memory write, it can cycle back thru the loop, and
>>be ready for the next store as soon as the current one finishes
>>(actually,
>>good architectures can handle a backlog of delayed writes, but they
>>eventually fill the delayed write queue and stall.  But the point is,
>>a bitmap program will be able to drive memory at 10M words per second
>>(assuming P6/200 memory speeds) which is 10M moves per second, rather
>>than the read/write copy scheme that will be only 1/2 the speed.
>>
>>Once we get the firstone in hardware so we don't have to waddle off to
>>a procedure for every move...
>
>Note that firstone is already in the hardware instruction at 386.
>Called BSF and BSR.
>

not exactly.  It is there, but it operates on a 32 bit value, not a
64 bit value.  So it has to be called twice, and the result has to
be adjusted depending on which call finds a 1 bit..


>Speed of BSF/R at pentium pro: 2/3 micro-op and non pairable, where
>normal
>instructions like MOV are 1 micro-op and pairable.

actually this is not true, according to my ppro manual.  The bsf/r is
broken up into RISC (micro-op) instructions that run along with
everything
else.  The pro doesn't "pair" instructions at all...  That's the old
pentium
approach...


>
>But supposing that this cache-to-register-copy gets as fast as you
>want that firstone to be, then this cache approach always beats you, and
>the facts are there. Please don't tell stories about why i could not
>get my generation table out of cache but need to get it out of slow
>memory,
>caches get bigger and i have done EFFORT to get my tables within L2
>cache, and in the future that'll be in L1 cache too, where in the future
>that firstone of your still is that slow at intel (and suppose BSF and
>BSR
>get faster then you still need the same slow loop to get the squares out
>AFTER you have way more overhead than i did, where i get the squares
>directly).

FirstOne() doesn't have to be slow.  It is done in one clock on the
Cray,
and is done in one clock on the Intel in the FPU for normalization.  So
it
will be fast on nextgeneration machines, as it should be...


>
>Problems of bitboards are not the BSF and BSR instructions.
>problem is that you still need that loop to get the squares out and that
>you need to program for EVERY piece, where i with 1 general purpose
>routine can generate all pieces.

you are underestimating, but that's ok.  You are barely 50% faster now.
And when my FirstOne() goes away in the move generation loop, my move
generator will be way more than 50% faster, just getting rid of the
FirstOne
memory traffic.  Then I end up with a very short loop with 3-4
instructions
plus a memory write.  The 3-4 instructions get hidden behind the write,
and
everything basically runs as fast as memory can take the data.  Which is
also
your limit, except you will be pumping 8X as much data as I do, once we
reach
that architecture with a real FirstOne() [64bit] in hardware.



>
>If i would program for generation speed and store it in 1 int and
>program
>everything for every piece then i'm of course way more faster than the
>current speed diff.
>
>The second problem of bitboards is that you can do only a limited number
>of things with it.
>
>Suppose i have a table with mobility values for it for every square,
>where
>values are generally different, as controlling a1 is not as important
>like controlling d4 is. Quite logical.
>
>Can you show some sample code how you count this with bitboards?
>That's pure horror in my opinion.
>

easy on the Cray.  Load your 64 words into an array (vector).  Load the
bitmap with 1's on each square that piece can reach into the vector mask
register.  Do a vector merge which returns me *just* the values from the
table where your bishop (or whatever) can reach.  Sum those with a
vector
add and I'm done.

On a non-vector machine, it takes more time, but it still can be very
efficient.  your problem is that you haven't used bitmaps.  And it takes
a good bit of time to "think bitmaps"... just as it takes a good bit of
time to "think vector processing" to use a supercomputer.  15 minutes
won't
do it...




>Further i use in my evaluation attacktables which consists out of a
>number
>of things, like what piece attacks it and the number of attackers.
>
>The number of attackers is something i use a lot in my eval.
>I have this number in an array int A[64], which definitely is in the
>cache.
>So i lookup this with  A[square].
>
>Now i think this is also a major disadvantage of bitboards. To look it
>up with bitboards you need to add all kinds of different things with
>masses
>of 'if'.
>

I can compute that same array quite easily.  and probably quite a bit
faster than you can compute it in the first place...  that's the point
here.  You want to find ways to keep everything in bitmaps, but that's
not the most efficient way at times.  But nothing says I can't use
bitmaps
to produce that a[64] array.  In fact, if you go back to crafty version
1-5, you will find just such an array, which was updated incrementally,
and was accessible just by fetching the entry you wanted...  I just
didn't
find good uses for that info and took it out, choosing to compute the
attacks dynamically at the point where they are needed so I can avoid
computing them when they aren't needed, to save time.



>I'm not asking to show how you do this in bitboards, that would give you
>nightmares. If you call a function then show the audience how much
>things that function does, JUST TO CALCULATE IT FOR 1 SQUARE,
>where i have all 64 ready.
>
>Where you are wating till you get an alpha for free, i'm waiting for a
>processor
>where branch prediction doesn't take that much systemtime when it
>mispredicts.


the pentium pro is not bad.  And if your compiler is any good, it can
use
the conditional move instruction which lets you copy from one of two
memory
addresses to a register, or from one of two registers to another
register,
without the necessity of a branch of any kind...  The Pro will already
get
all but 1 of each loop iteration's branches right...  which is the
majority
of the branches in a program...  And with speculative execution, it
doesn't
take a huge hit when it mispredicts..  and with "deep branch prediction
and speculative execution" most branches cause no problems whether they
are
taken or not.  You need to read "Pentium Pro Family Developer's Manual,
Volume 2: programmer's reference manual" published by Intel.  It
explains
all and shows why branches are not hurting your performance nearly as
much
as you seem to think...


>
>I have so many compares in my eval which takes 90% of systemtime, that
>i'm totally depending on the speed of the processor after misprediction.


again, this is not exactly true..  on a branch, the cpu follows *both*
paths until it becomes certain which path it can throw away.  And it
works
those instructions in thru its "pooling" to use cycles when a functional
unit would otherwise be idle anyway...  the penalty is much less than
you
think...


>
>That would speed up my program several times, where bitboards would
>slow it down another 4 times or so, and take 2 times more code too,
>which
>causes again some extra delay.

perfect branch prediction wouldn't speed you up by 25%, much less 4
times
or so...  again, intel has data that shows best and worst cases, and
worst
case is not 50%.  must less 400%



>
>Greetings,
>Vincent




This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.