Computer Chess Club Archives

Search

Terms
Messages

Subject: Random advice related to speed

Author: Scott Gasch
Date: 12:52:31 02/15/05
Well this started as a simple response to some points Pallav made but digressed
into a random tyrade about code speed improvement.  I've done a lot of this
stuff and have some tips.

There are two kinds of speedups: lossy and lossless.  Lossy ones are shortcuts
like lazy eval... where you are searching a different tree after the change.
These are great for speed but can kill you for strength so separate the issues
of "nodes per second" from "my program is strong" in your brain.  Just a few
weeks ago I tested a change where I added an eval hash to my engine... the
thought was, I call Eval() in the middle of the tree quite frequently (for
"razoring" and other ideas)... so quite often I am re-evaluating the same
position I just evaluated not too long ago.  I thought it would be a big win to
keep a small table of the last position sigs, their a..b window, their eval, and
some other data my eval returns.  Well I did this and it was great... 15knps
faster.  But when I ran the new version against test suites it did not do as
well... in fact markedly worse.

Of course lossless speed comes from doing the same thing you do now only faster.
 I've profiled (with more than one profiler) and rewritten several functions in
x86 asm.  Two functions because they are slow and I don't like the code that my
compilers make for them.  Other functions because I can take advantage of x86
opcodes (see crafty's x86.s or whatever for an example... things like bsr, bsf).
 While we're talking about lossless speed, here's another example: max(x,y),
min(x,y), min(x,0), max(x,0), abs(x-y) (and probably more) can sometimes be
written without a branch.  Note: if your compiler is smart enough to use cmov
instruction then you probably don't care.  But I am using gcc on fbsd and VC.NET
on win32... the latter doesn't do a good job with cmovs.  Now if x and y both
don't have a MSB set then:

1. (x - y) is negative if y > x or positive if x > y.
2. (int)(x - y) >> 31 is 0xffffffff or 0x00000000
3. ((int)(x - y)) & (x - y) is either (x - y) or 0
4. ((int)(x - y)) & (x - y) + y is either x - y + y (==x) or +y

Search online or check out a book called "Hacker's Delight" for more of these.
The best way to spot these lossless changes is to profile, make a list of your
top N functions and read them.  While you read think: I am spending X% of the
execution time in here, what can I do to speed this up?  Rewrite those functions
several ways and pick the one that is fastest.  Unassemble them.  etc...

When you make a lossless change, run a whole test suite, count the total number
of nodes visited and the nps, make the change, and run the same test again.
Make sure the node count didn't change (and the nps went up).

SOmewhat related: make sure you are using some kind of version control.  I can't
recommend this enough.  Further advice: make ONE CHANGE per version.  This is
something I learned the hard way.  If you just start hacking around in your code
and check in a new version with 10 new changes you'll never know which one
screwed you up.  Of course, it goes without saying, run some kind of test on the
code before you check it in.  I run an overnight ECM and paste the ECM results
in the version checkin tag.

On February 15, 2005 at 12:59:42, Michel Langeveld wrote:

>On February 15, 2005 at 12:26:36, Pallav Nawani wrote:
>
>>Hi,
>>
>>My chess engine, Natwarlal is very slow, getting 350KNPS on an Athlon 1700.
>>Crafty, on the other hand does an hefty 650K. What is the most important thing
>>for high NPS of the engine? There are a few speedup tricks that I know, and I am
>>not using them, but can anyone quantify the speedup in % provided by these,
>>and/or add some  new ideas?
>>
>>1. Pawn hash (Natwarlal uses it, but not efficiently).
>>2. King hash
>>3. Eval cache
>>4. Lazy eval. Natwarlal dosen't use this, but then Natwarlal's eval is very
>>skimpy already!
>>5. Any other trick?
>>
>>Thanks,
>>Pallav
>
>Besided what the profiling that Dann told.
>
>compiler stuff:
>* select a fast language (assembler? c? etc?)
>* use a fast compiler (64bit os? intel? msvc2005? gcc? etc?)
>* compile with all optimizations on, and test on and off with each option.
>* include everything in one big file instead of multiple files

Some ideas in this category:

1. Try different compiler settings -- for example -O2 -O3 -O4 on gcc... also try
"custom" optimization levels like -O2 --fexpensive_optimizations etc...
2. Make sure you tell the compiler what processor you are building for.  Gcc is
good at using cmov instructions to avoid small branches but only if you tell it
you are building for a cpu that can handle them.  (FWIW: my version of MSVC,
which is VC.NET, seems to be rather bad at generating cmovs.  Here's hoping the
code gen guys spend some time on x86 backend for the next release).
3. Try different link orders... in general I've found that if I try to keep
frequently used functions close together the engine is 2-3% faster.  I once
wrote a perl script to permute the order of some .obj files in my build and
benchmark the resulting binaries.
4. Know what calling conventions are and when they make sense.
5. Use static and inline to help the compiler know when a function can be
eliminated (and inlined)
6. Unassemble your most important functions and see what code you got.  Is that
function you think is being inlined really inline?  If not think about a #define
or __forceinline.  Is your compiler generating good code?  If you think you can
do better, consider rolling your own.

>chess related:
>* try all items in many ways and use the fastest - so work hard
>* try different algoritms and use the fastest - so work hard
>* use the hashmove, and skip generating moves in many positions)

1. Try different move generators and pick the fastest.
2. Be as lazy as possible.  If you don't need a piece of information yet and
there is a chance that you might never need it, don't compute it.  Compute it
only when you are sure you need it.
3. When you read your code, always think: is there another way to do this?  In
general favor small, solid code.  But always be on the lookout for another way
to do something.  Especially if the other way eliminates branches, reduces
computation etc...  An example: I recently moved a somewhat large conditional in
the middle of my mobility loops to code that jumps through a table of function
pointers depending on the piece on the square.  It was a small speed gain
despite the overhead of calling the function.

>
>if you want squeez the last exra bit .... (and make it unreadable and impossible
>to go on ... finally):
>
>general:
>* keep your code compact as possible, or test to write some other items out
>* make items global, instead of passing them to each functions: like board

Note: this will help you speed but make your life hell if you ever decide to
make your engine multithreaded (i.e. parallel search).

>* look at your eval, and keep some items incremental instead of updating each
>items from scratch again and again

I think what he means here is to do things at MakeMove / UnMakeMove so they are
just sitting there in the position struct when you get to eval.  This is smart
for somethings.

In addition, look at your eval code and find common terms... for example all
over the place in eval I am scaling things based on material... also I do some
stuff with pawn bitboards / which ones are stationary etc... Move all that code
to the front of eval (just after you make your lazy decision maybe) and do it
once.  There's no sense in computing the same junk over and over again.

>* making some items const instead of variable
>* test the speed difference of the members in a struct
>* test the speed difference of the type of a variable (int,long, short, char,
>etc).

If you are not trying to conserve space, use a variable that is the same size as
a register on the machine.

>* try to use as less as possible if (b) but use lookup tables instead
>* don't make your lookup tables too big ... in order to keep your program size
>as small as possibel

Know the cache line size on your machine.  I use this information, for example,
to multi-probe the hash table.  When I touch the first hash table entry I know
that the next three are all in the cache.  So it's essentially free to check
them too.

>* program import stuff in assembler
>
>more?

Make sure you commit as much stack space as you need.  You probably need more
stack space than the default linker gives you.  Touching a guard page and
waiting for the kernel to grow your stack for you is really slow and not
something you want to do in a search.  Worse yet, if (for some reason) there is
no free commit on your machine when you try to grow the stack your process will
simply be terminated.  This is unlikely but make sure you ask for enough stack.

This is a bit overboard (maybe) but be friends with your OS.  I lock the memory
behind my hash table so that I am sure I won't take a page fault while probing
it.

There are a lot of good webpages about optimizing code with general rules...
like avoid multiplies/divides unless they can be made into shifts... also make
your multidimensional table indexes powers of 2 so the compiler can use shifts
to compute offsets.  Remove branches when you can.  etc...

Good luck,
Scott
Re: Random advice related to speed Eric Oldre 21:33:25 02/15/05
- Re: Random advice related to speed Scott Gasch 12:25:13 02/16/05
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.