Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Question to Bob: Crafty , Alpha and FindBit()

Author: Eugene Nalimov

Date: 13:56:58 06/10/98

Go up one level in this thread


On June 10, 1998 at 15:57:49, Robert Hyatt wrote:

>On June 10, 1998 at 11:34:37, Vincent Diepeveen wrote:
>
>>
>>On June 08, 1998 at 22:50:16, Robert Hyatt wrote:
>>
>>[about generating speed]
>>
>>I had something left to post. After we talked at ICC, i got some
>>questions
>>so i'll put the comparision here too.
>>
>>At my Pro200 crafty is indeed faster, but that has a different reason:
>>i'm running a crafty version which is compiled with microsoft visual
>>c++,
>>just like my program is, and at your pro you compile under linux, so
>>you may be glad that there is a compiler for that platform called GCC.
>>
>>So i'll use that msvc++ version to compare with.
>>I took 2 positions.
>>
>>First i took openingsposition, as we start the game always over there,
>>then i secondly also took BK22 as you liked to see.
>>
>>Same test also done at a PII-266 EDORAM by Bradley Woodward,
>>to see how fast a PII is relatively (intel says PII processor is in fact
>>a
>>pentium pro processor so we should see the same speed differences)
>>to the programs.
>>
>
>A pentium II is not as good as a pentium pro.  The Ppro L2 cache runs at
>processor clock cycle time, while the PII L2 cache runs at 1/2  the
>processor clock cycle time, which is a measurable penalty for the PII.
>
>
>>Openingsposition Pentium Pro EDO 60ns:
>>    Crafty  :  2,103,049
>>    Diep    :  3,081,664       46.5% faster
>>
>>BK22 Pentium Pro EDO 60ns:
>>    Crafty  :  2,941,176
>>    Diep   :  3,699,552        25.8% faster
>>
>>Openingsposition PII-266 EDO
>>    Crafty  : 2,557,544
>>    Diep    : 3,906,250       52.7% faster
>>
>>BK22 PII-266 EDO
>>    Crafty : 3,470,031
>>    Diep  :  4,874,446        40.5% faster
>>
>>Note that i store a move with 4 ints, where crafty stores it in 1 int.
>>So i'm doing 4 times more and i do that about 50% faster.
>
>wrong.  Note that my FirstOne() function does between 1 to 4 memory
>reads
>to find the first one bit...  which will totally go away on the new
>alpha
>processor since it will be done in hardware just like it is done on the
>Cray...
>
>
>
>>
>>266/200 * 3669552 = 4880504 which equals 4874446 almost.
>>266/200 * 3081664 = 4098613 which is almost 200K slower.
>>
>>So we can say that diep gets faster when Mhz gets faster, where
>>crafty has big problems to become faster when Mhz gets faster.
>>
>>How comes?
>>
>
>FirstOne() is the culprit..  On a new machine, our move generators break
>down like this:  You do four loads and 4 stores per move, assuming you
>are
>using 4 ints everywhere, including your move table.  Crafty is going to
>do
>only one store per move, plus 1,2,3 or 4 memory loads per piece that it
>generates moves for.  My bandwidth is therefore going to be something
>like
>1.5 memory accesses per move, average, which will be quite fast.  At
>present, for every move I generate, I have to call FirstOne() which
>looks
>up 16 bit chunks of the 64bit bitmap in a large array.  Which means most
>of
>my memory bandwidth in the move generator is FirstOne() rather than
>simply
>pumping moves to memory.
>
>At present, this stresses the PC bandwidth, and, in particular, the
>grossly
>bad L2 cache on the PII, so that each of my table lookups, even when
>they
>hit cache, take twice as long to satisfy as they do on the pentium pro.
>The
>PII is just not a good design at present, when compared to the Pro...
>
>This will change when the slot-2 PII hits the street later this year
>with
>a cpu-clock speed cache once again...
>
>
>
>
>>Lonnie tested the K6-266 SDRAM
>>risc machine to get 3.34M generations a second
>>at openings position, and at 300Mhz (100Mhz bus) it gets 3.75M
>>generations
>>a second.
>>
>>Still not getting near PII speed, but considering its price a very good
>>processor.
>>
>>>one thing I didn't explain that deserves a comment.  Probably the
>>>fastest
>>>way I know to generate moves is the "move table" approach, which lets
>>>you
>>>almost reduce a move generator to a memory-to-memory copy.  which on a
>>>PC
>>>is not a great thing.  IE on the pentium pro/200mhz machine, a memory
>>>read takes about 20 clock cycles (100ns).  Ditto for each memory write.
>>>So a memory copy runs at a rate of one word every 200ns, or about 5
>>>million words per second, max.
>>
>>>When we take programs like darkthought or crafty (or other bitmap
>>>programs) to a 64 bit architecture, that has a FirstOne() or LeadZ()
>>>type instruction, then the bitmap generators get a big boost.. because
>>>we only have to do 1/2 the memory traffic (stores only) since the moves
>>>come from the 64bit register rather than memory.  And while the CPU is
>>>waiting on the slow memory write, it can cycle back thru the loop, and
>>>be ready for the next store as soon as the current one finishes
>>>(actually,
>>>good architectures can handle a backlog of delayed writes, but they
>>>eventually fill the delayed write queue and stall.  But the point is,
>>>a bitmap program will be able to drive memory at 10M words per second
>>>(assuming P6/200 memory speeds) which is 10M moves per second, rather
>>>than the read/write copy scheme that will be only 1/2 the speed.
>>>
>>>Once we get the firstone in hardware so we don't have to waddle off to
>>>a procedure for every move...
>>
>>Note that firstone is already in the hardware instruction at 386.
>>Called BSF and BSR.
>>
>
>not exactly.  It is there, but it operates on a 32 bit value, not a
>64 bit value.  So it has to be called twice, and the result has to
>be adjusted depending on which call finds a 1 bit..
>
>
>>Speed of BSF/R at pentium pro: 2/3 micro-op and non pairable, where
>>normal
>>instructions like MOV are 1 micro-op and pairable.
>
>actually this is not true, according to my ppro manual.  The bsf/r is
>broken up into RISC (micro-op) instructions that run along with
>everything
>else.  The pro doesn't "pair" instructions at all...  That's the old
>pentium
>approach...
>
>
>>
>>But supposing that this cache-to-register-copy gets as fast as you
>>want that firstone to be, then this cache approach always beats you, and
>>the facts are there. Please don't tell stories about why i could not
>>get my generation table out of cache but need to get it out of slow
>>memory,
>>caches get bigger and i have done EFFORT to get my tables within L2
>>cache, and in the future that'll be in L1 cache too, where in the future
>>that firstone of your still is that slow at intel (and suppose BSF and
>>BSR
>>get faster then you still need the same slow loop to get the squares out
>>AFTER you have way more overhead than i did, where i get the squares
>>directly).
>
>FirstOne() doesn't have to be slow.  It is done in one clock on the
>Cray,
>and is done in one clock on the Intel in the FPU for normalization.  So
>it
>will be fast on nextgeneration machines, as it should be...
>
>
>>
>>Problems of bitboards are not the BSF and BSR instructions.
>>problem is that you still need that loop to get the squares out and that
>>you need to program for EVERY piece, where i with 1 general purpose
>>routine can generate all pieces.
>
>you are underestimating, but that's ok.  You are barely 50% faster now.
>And when my FirstOne() goes away in the move generation loop, my move
>generator will be way more than 50% faster, just getting rid of the
>FirstOne
>memory traffic.  Then I end up with a very short loop with 3-4
>instructions
>plus a memory write.  The 3-4 instructions get hidden behind the write,
>and
>everything basically runs as fast as memory can take the data.  Which is
>also
>your limit, except you will be pumping 8X as much data as I do, once we
>reach
>that architecture with a real FirstOne() [64bit] in hardware.
>
>
>
>>
>>If i would program for generation speed and store it in 1 int and
>>program
>>everything for every piece then i'm of course way more faster than the
>>current speed diff.
>>
>>The second problem of bitboards is that you can do only a limited number
>>of things with it.
>>
>>Suppose i have a table with mobility values for it for every square,
>>where
>>values are generally different, as controlling a1 is not as important
>>like controlling d4 is. Quite logical.
>>
>>Can you show some sample code how you count this with bitboards?
>>That's pure horror in my opinion.
>>
>
>easy on the Cray.  Load your 64 words into an array (vector).  Load the
>bitmap with 1's on each square that piece can reach into the vector mask
>register.  Do a vector merge which returns me *just* the values from the
>table where your bishop (or whatever) can reach.  Sum those with a
>vector
>add and I'm done.
>
>On a non-vector machine, it takes more time, but it still can be very
>efficient.  your problem is that you haven't used bitmaps.  And it takes
>a good bit of time to "think bitmaps"... just as it takes a good bit of
>time to "think vector processing" to use a supercomputer.  15 minutes
>won't
>do it...
>
>
>
>
>>Further i use in my evaluation attacktables which consists out of a
>>number
>>of things, like what piece attacks it and the number of attackers.
>>
>>The number of attackers is something i use a lot in my eval.
>>I have this number in an array int A[64], which definitely is in the
>>cache.
>>So i lookup this with  A[square].
>>
>>Now i think this is also a major disadvantage of bitboards. To look it
>>up with bitboards you need to add all kinds of different things with
>>masses
>>of 'if'.
>>
>
>I can compute that same array quite easily.  and probably quite a bit
>faster than you can compute it in the first place...  that's the point
>here.  You want to find ways to keep everything in bitmaps, but that's
>not the most efficient way at times.  But nothing says I can't use
>bitmaps
>to produce that a[64] array.  In fact, if you go back to crafty version
>1-5, you will find just such an array, which was updated incrementally,
>and was accessible just by fetching the entry you wanted...  I just
>didn't
>find good uses for that info and took it out, choosing to compute the
>attacks dynamically at the point where they are needed so I can avoid
>computing them when they aren't needed, to save time.
>
>
>
>>I'm not asking to show how you do this in bitboards, that would give you
>>nightmares. If you call a function then show the audience how much
>>things that function does, JUST TO CALCULATE IT FOR 1 SQUARE,
>>where i have all 64 ready.
>>
>>Where you are wating till you get an alpha for free, i'm waiting for a
>>processor
>>where branch prediction doesn't take that much systemtime when it
>>mispredicts.
>
>
>the pentium pro is not bad.  And if your compiler is any good, it can
>use
>the conditional move instruction which lets you copy from one of two
>memory
>addresses to a register, or from one of two registers to another
>register,
>without the necessity of a branch of any kind...  The Pro will already
>get
>all but 1 of each loop iteration's branches right...  which is the
>majority
>of the branches in a program...  And with speculative execution, it
>doesn't
>take a huge hit when it mispredicts..  and with "deep branch prediction
>and speculative execution" most branches cause no problems whether they
>are
>taken or not.  You need to read "Pentium Pro Family Developer's Manual,
>Volume 2: programmer's reference manual" published by Intel.  It
>explains
>all and shows why branches are not hurting your performance nearly as
>much
>as you seem to think...

Sorry Bob,

But if you'll look in the other Intel manual - AP-526,
"Optimization for Intel's 32-Bit Pcocessors" (available
for download from Intel web site), section 2.4.4 (Branch
Target Buffer for Pentimu Pro Processor), you can read

"The penalty for mispredicted branches is at least 9
cycles (the length of the In-Order Issue Pipeline) of
lost instruction fetch, plus additional time spent
waiting for the mispredicted branch instruction to
become the oldest instruction in the machine and retire.
This penalty is non-deterministic, dependant upon
execution circumstances, but experimentation shows that
this is nominally a total of 10-15 cycles".

Also, if you remember, when Crafty still used assembly
code, and I removed branches in several routines
(including, BTW, FirstOne), Crafty become 1.5% faster.
Code size decreased very slightly, there was still cost
of function call (not at execution time - it greatly
complicated optimizer/register allocation work, with
only 7 registers available), so I think that main
speedup come from removing of mispredicted branches.

Eugene

>
>>
>>I have so many compares in my eval which takes 90% of systemtime, that
>>i'm totally depending on the speed of the processor after misprediction.
>
>
>again, this is not exactly true..  on a branch, the cpu follows *both*
>paths until it becomes certain which path it can throw away.  And it
>works
>those instructions in thru its "pooling" to use cycles when a functional
>unit would otherwise be idle anyway...  the penalty is much less than
>you
>think...
>
>
>>
>>That would speed up my program several times, where bitboards would
>>slow it down another 4 times or so, and take 2 times more code too,
>>which
>>causes again some extra delay.
>
>perfect branch prediction wouldn't speed you up by 25%, much less 4
>times
>or so...  again, intel has data that shows best and worst cases, and
>worst
>case is not 50%.  must less 400%
>
>
>
>>
>>Greetings,
>>Vincent
>



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.