Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Speedups for BitBoard programs on 64-bit machines

Author: Robert Hyatt

Date: 21:30:13 06/07/02

Go up one level in this thread


On June 06, 2002 at 20:54:41, Vincent Diepeveen wrote:

>On June 05, 2002 at 22:01:30, Robert Hyatt wrote:
>
>PERHAPS IT IS TIME YOU PROFILE CRAFTY AGAIN.

So the profile from _last week_ was too old for you?  I haven't changed one
line of code since that profile run was done...

remember that I am doing profile-based optimizations so I _must_ profile
every time I change a line to get the fastest crafty version.

Care to re-think that statement now???



>
>Previous run you did must have been 10 years ago or so
>that you guess it is needing 50% system time for its
>evaluation.

Eh?  Crafty hasn't been around 10 years.  That profile was very recent.


>
>The position where evaluation takes most system time is
>the opening of course. A long profile run from there
>is most 'lucky' from crafty's viewpoint.



NO...  the most time happens when there are passed pawns.  And king
safety issues.  Because those aren't hashed and are computed dynamically
each time a position is evaluated.




>
>No way i can let it eat more than 42% of the system time
>for *all* the evaluation functions together.
>
>Note that the pawn structure eats very little system time here.

I assume you profile on the opening position?

I do something better than that so that the feedback optimizations work
for _all_ parts of the game.




>
>Perhaps taking many MBs for pawnhashtable size is a bit big?
>
>I took 96MB for hashtable and 12MB for pawntable or so.
>
>In short in the far endgame it'll be more like 20% of the system
>time going to evaluation.

So?  the middlegame is where the "action" is...


>
>Note that this was a parallel compile, but not a parallel run. So
>in reality some overhead is wasted to that too, but i see that as
>a loss everyone loses anyway.

There is no loss in crafty for a parallel compile with no parallel
run.  That costs two instructions per node, which is nothing that can
be measured.  As in < .1%




>
>>On June 05, 2002 at 13:32:52, Vincent Diepeveen wrote:
>>
>>>On June 05, 2002 at 04:17:12, Bas Hamstra wrote:
>>>
>>>you forget to mention evaluation. Seems you guys forget
>>>that chess is about evaluation of a position. You need
>>>so much system time for SEE, Makemove and unmaking
>>>moves that it seems you simply have *no time* for evaluation!
>>>
>>>If the capturing routine/SEE in qsearch is eating all of your system
>>>time, then my advice is to NOT use a qsearch, but to use an evaluation
>>>that directly estimates what you could possibly lose. Just 1 piece
>>>of course. Remove that from the score then.
>>>
>>>That's a few clocks more and your thing gets a few million nodes a second,
>>>but for sure searches 3 ply deeper.
>>
>>My results here are already well known.  Crafty spends nearly 50% of the total
>>time in Evaluate() and its sub-functions.  SEE, etc are all very small parts.
>>I see Evaluate range from a low of 33% of total search time to a high of just
>>over 55%.  Note that in the profile code you have to look carefully to get
>>all of the individual parts of Evaluate() and not just Evaluate() by itself.
>>
>>
>>
>>>
>>>>On June 04, 2002 at 20:31:41, Robert Hyatt wrote:
>>>>
>>>>>On June 04, 2002 at 18:01:03, Gian-Carlo Pascutto wrote:
>>>>>
>>>>>>On June 04, 2002 at 17:52:47, Dann Corbit wrote:
>>>>>>
>>>>>>>On June 04, 2002 at 16:28:39, Gian-Carlo Pascutto wrote:
>>>>>>>
>>>>>>>>On June 04, 2002 at 16:18:55, Gian-Carlo Pascutto wrote:
>>>>>>>>
>>>>>>>>>Because you are using a processor that is clocked at twice the clock
>>>>>>>>>frequency?  Why compare a 1ghz processor to a (nearly) 2ghz processor
>>>>>>>>>and conclude anything about efficiency there?  Is there anything that
>>>>>>>>>suggests that the alpha is simply more "efficient"?  To justify that
>>>>>>>>>clock frequency disparity?
>>>>>>>>>
>>>>>>>>>A machine twice as fast (clock freq) _should_ perform just as well as
>>>>>>>>>a 64 bit machine at 1/2 the frequency...  Less would suggest that the
>>>>>>>>>32 bit machine simply sucks badly.
>>>>>>>>
>>>>>>>>I don't agree with the validity of a clock-for-clock comparison,
>>>>>>>>but if you want to do it anyway, I'll again point to Vincent's
>>>>>>>>numbers:
>>>>>>>>
>>>>>>>>At the same clockspeed, Crafty only gets 33% faster on the 64-bits
>>>>>>>>machine.
>>>>>>>>
>>>>>>>>When you read this, keep in mind that most applications get _more_
>>>>>>>>than 33% faster on the 64-bits machine.
>>>>>>>
>>>>>>>All the new 64 bit chips in the discussion are pretty much beta stage right
>>>>>>>now.
>>>>>>
>>>>>>Not true for the Alpha.
>>>>>
>>>>>Depends on the alpha being discussed.  DEC had processors beyond the 21264
>>>>>running.  Although the 21264 was pretty good.  Dann was a bit off on the
>>>>>performance as Tim Mann was running a 21264 at 600mhz and getting right at
>>>>>1M nodes per second.  Mckinley is getting 1.5M at 1000mhz, so the alpha might
>>>>>have a bit of an advantage still. but it is pretty small...
>>>>>
>>>>>Mckinley is only available to a select few.  21264's are fairly common.
>>>>>Anything beyond that is not readily available...
>>>>>
>>>>>
>>>>>>
>>>>>>>So, I think that architecturally, it makes good sense to design for a 64 bit
>>>>>>>system right now.
>>>>>>
>>>>>>That makes sense, if the 64 bit design is actually faster than the corresponding
>>>>>>32 bit design (even on 64 bit hardware if you wish).
>>>>>>
>>>>>>The case for bitboards is not clear on that matter. Certainly, if
>>>>>>the speedup over nonbitboards is only 33% they will have a hard time
>>>>>>convincingly beating alternative appraoches even on 64 bit hardware.
>>>>>>
>>>>>>--
>>>>>>GCP
>>>>>
>>>>>You are assuming that bitboards are _slower_ than non-bitboard programs on
>>>>>32 bit machines.  I haven't seen this demonstrated yet.  We can always do some
>>>>>sort of a test.  IE since the most common move generator issue is "generate all
>>>>>captures" we can try that with bitboard and non-bitboard approaches to see if
>>>>>one is really much better than the other on 32 bit machines.  I don't think so
>>>>>myself.  I think they are pretty equal due to the multiple pipe issue.
>>>>>
>>>>>But a test could be done to see, since this is the most common thing needed
>>>>>in a chess engine.
>>>>
>>>>That's not a fair test, I think. IMO the most heavily used routines are:
>>>>
>>>>- See()
>>>>- GenCaps()
>>>>- SquareAttacked()
>>>>- Make/Unmake()
>>>>
>>>>You just pick the one in which bitboards is good. In fact it is nearly
>>>>impossible to figure out what's best overall by comparing only parts. What you
>>>>could do though, is generate "profile data" about a search in average middlegame
>>>>positions, and see how many times each of the above functions is being called.
>>>>Then we could turn this into a sort of benchmark:
>>>>
>>>>10.000 * a()
>>>>8.000 * b()
>>>>3.000 * d()
>>>>5000 * c()
>>>>
>>>>and compare times for bitboards and 0x88 to do this. This would at least tell us
>>>>if bitboards is faster *for Crafty*.
>>>>
>>>>
>>>>Best regards,
>>>>Bas.



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.