Author: Gerd Isenberg
Date: 12:49:36 09/16/04
Go up one level in this thread
On September 16, 2004 at 13:53:12, Dann Corbit wrote: >On September 16, 2004 at 13:18:48, Russell Reagan wrote: > >>On September 16, 2004 at 03:12:33, Tony Werten wrote: >> >>>Yes. I needed a rewrite anyway, and Borland doesn't seem willing to produce a >>>64bit compiler in the near future, wich is a big disadvantage since I wanted to >>>use the Kogge-Stone stuff rather than the 0x88 I used until now. ( actually, it >>>will be a kind of a mixture) >> >>How has the Kogge-Stone stuff been working for you? I was never able to get it >>to work efficiently enough (rotated bitboards were at least 2x faster). Of >>course, I didn't write MMX assembly like Gerd, so obviously it won't be as fast >>as his approach. > >I have found that assembly language can impede the ability of the optimizer. So >a routine in assembly that will bench twice as fast in a simple test harness >will not cause any discernable difference in a large program, or even slow it >down. That may happen, specially with small inlined msc assembly with fixed register allocation, where parameters are pushed on stack even if already in a register. GCC assembly seems smarter here. Probably one reason why ms doesn't support inline assembly any longer but intrinsics under compiler/optimizer's control. For Kogge-Stone or dumb7 fill routines it is necessary to use 64-bit register files. At least if you want to process several directions or generators simultaniously. If one don't use x87 floating point, the eight 64-bit mmx registers are quite nice ressource to do such fill stuff on x86-32. The mmx-intrinsics with msc6 are not that smart and suffer from a lot from unneccesary load/stores. Using up to eight 64-mmx registers with one or two gp-registers to address source or target structures, msc inline assembly clearly outperforms mmx-intrinsics. > >So assembly always needs to be benchmarked in the place you intend to use it. Yes. If you test rotated attack getters with lookup tables versus extensive stallfree register processing, the first is even faster in testloops ;-)
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.