Author: Scott Gasch
Date: 12:52:31 02/15/05
Go up one level in this thread
Well this started as a simple response to some points Pallav made but digressed into a random tyrade about code speed improvement. I've done a lot of this stuff and have some tips. There are two kinds of speedups: lossy and lossless. Lossy ones are shortcuts like lazy eval... where you are searching a different tree after the change. These are great for speed but can kill you for strength so separate the issues of "nodes per second" from "my program is strong" in your brain. Just a few weeks ago I tested a change where I added an eval hash to my engine... the thought was, I call Eval() in the middle of the tree quite frequently (for "razoring" and other ideas)... so quite often I am re-evaluating the same position I just evaluated not too long ago. I thought it would be a big win to keep a small table of the last position sigs, their a..b window, their eval, and some other data my eval returns. Well I did this and it was great... 15knps faster. But when I ran the new version against test suites it did not do as well... in fact markedly worse. Of course lossless speed comes from doing the same thing you do now only faster. I've profiled (with more than one profiler) and rewritten several functions in x86 asm. Two functions because they are slow and I don't like the code that my compilers make for them. Other functions because I can take advantage of x86 opcodes (see crafty's x86.s or whatever for an example... things like bsr, bsf). While we're talking about lossless speed, here's another example: max(x,y), min(x,y), min(x,0), max(x,0), abs(x-y) (and probably more) can sometimes be written without a branch. Note: if your compiler is smart enough to use cmov instruction then you probably don't care. But I am using gcc on fbsd and VC.NET on win32... the latter doesn't do a good job with cmovs. Now if x and y both don't have a MSB set then: 1. (x - y) is negative if y > x or positive if x > y. 2. (int)(x - y) >> 31 is 0xffffffff or 0x00000000 3. ((int)(x - y)) & (x - y) is either (x - y) or 0 4. ((int)(x - y)) & (x - y) + y is either x - y + y (==x) or +y Search online or check out a book called "Hacker's Delight" for more of these. The best way to spot these lossless changes is to profile, make a list of your top N functions and read them. While you read think: I am spending X% of the execution time in here, what can I do to speed this up? Rewrite those functions several ways and pick the one that is fastest. Unassemble them. etc... When you make a lossless change, run a whole test suite, count the total number of nodes visited and the nps, make the change, and run the same test again. Make sure the node count didn't change (and the nps went up). SOmewhat related: make sure you are using some kind of version control. I can't recommend this enough. Further advice: make ONE CHANGE per version. This is something I learned the hard way. If you just start hacking around in your code and check in a new version with 10 new changes you'll never know which one screwed you up. Of course, it goes without saying, run some kind of test on the code before you check it in. I run an overnight ECM and paste the ECM results in the version checkin tag. On February 15, 2005 at 12:59:42, Michel Langeveld wrote: >On February 15, 2005 at 12:26:36, Pallav Nawani wrote: > >>Hi, >> >>My chess engine, Natwarlal is very slow, getting 350KNPS on an Athlon 1700. >>Crafty, on the other hand does an hefty 650K. What is the most important thing >>for high NPS of the engine? There are a few speedup tricks that I know, and I am >>not using them, but can anyone quantify the speedup in % provided by these, >>and/or add some new ideas? >> >>1. Pawn hash (Natwarlal uses it, but not efficiently). >>2. King hash >>3. Eval cache >>4. Lazy eval. Natwarlal dosen't use this, but then Natwarlal's eval is very >>skimpy already! >>5. Any other trick? >> >>Thanks, >>Pallav > >Besided what the profiling that Dann told. > >compiler stuff: >* select a fast language (assembler? c? etc?) >* use a fast compiler (64bit os? intel? msvc2005? gcc? etc?) >* compile with all optimizations on, and test on and off with each option. >* include everything in one big file instead of multiple files Some ideas in this category: 1. Try different compiler settings -- for example -O2 -O3 -O4 on gcc... also try "custom" optimization levels like -O2 --fexpensive_optimizations etc... 2. Make sure you tell the compiler what processor you are building for. Gcc is good at using cmov instructions to avoid small branches but only if you tell it you are building for a cpu that can handle them. (FWIW: my version of MSVC, which is VC.NET, seems to be rather bad at generating cmovs. Here's hoping the code gen guys spend some time on x86 backend for the next release). 3. Try different link orders... in general I've found that if I try to keep frequently used functions close together the engine is 2-3% faster. I once wrote a perl script to permute the order of some .obj files in my build and benchmark the resulting binaries. 4. Know what calling conventions are and when they make sense. 5. Use static and inline to help the compiler know when a function can be eliminated (and inlined) 6. Unassemble your most important functions and see what code you got. Is that function you think is being inlined really inline? If not think about a #define or __forceinline. Is your compiler generating good code? If you think you can do better, consider rolling your own. >chess related: >* try all items in many ways and use the fastest - so work hard >* try different algoritms and use the fastest - so work hard >* use the hashmove, and skip generating moves in many positions) 1. Try different move generators and pick the fastest. 2. Be as lazy as possible. If you don't need a piece of information yet and there is a chance that you might never need it, don't compute it. Compute it only when you are sure you need it. 3. When you read your code, always think: is there another way to do this? In general favor small, solid code. But always be on the lookout for another way to do something. Especially if the other way eliminates branches, reduces computation etc... An example: I recently moved a somewhat large conditional in the middle of my mobility loops to code that jumps through a table of function pointers depending on the piece on the square. It was a small speed gain despite the overhead of calling the function. > >if you want squeez the last exra bit .... (and make it unreadable and impossible >to go on ... finally): > >general: >* keep your code compact as possible, or test to write some other items out >* make items global, instead of passing them to each functions: like board Note: this will help you speed but make your life hell if you ever decide to make your engine multithreaded (i.e. parallel search). >* look at your eval, and keep some items incremental instead of updating each >items from scratch again and again I think what he means here is to do things at MakeMove / UnMakeMove so they are just sitting there in the position struct when you get to eval. This is smart for somethings. In addition, look at your eval code and find common terms... for example all over the place in eval I am scaling things based on material... also I do some stuff with pawn bitboards / which ones are stationary etc... Move all that code to the front of eval (just after you make your lazy decision maybe) and do it once. There's no sense in computing the same junk over and over again. >* making some items const instead of variable >* test the speed difference of the members in a struct >* test the speed difference of the type of a variable (int,long, short, char, >etc). If you are not trying to conserve space, use a variable that is the same size as a register on the machine. >* try to use as less as possible if (b) but use lookup tables instead >* don't make your lookup tables too big ... in order to keep your program size >as small as possibel Know the cache line size on your machine. I use this information, for example, to multi-probe the hash table. When I touch the first hash table entry I know that the next three are all in the cache. So it's essentially free to check them too. >* program import stuff in assembler > >more? Make sure you commit as much stack space as you need. You probably need more stack space than the default linker gives you. Touching a guard page and waiting for the kernel to grow your stack for you is really slow and not something you want to do in a search. Worse yet, if (for some reason) there is no free commit on your machine when you try to grow the stack your process will simply be terminated. This is unlikely but make sure you ask for enough stack. This is a bit overboard (maybe) but be friends with your OS. I lock the memory behind my hash table so that I am sure I won't take a page fault while probing it. There are a lot of good webpages about optimizing code with general rules... like avoid multiplies/divides unless they can be made into shifts... also make your multidimensional table indexes powers of 2 so the compiler can use shifts to compute offsets. Remove branches when you can. etc... Good luck, Scott
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.