Author: Matt Taylor
Date: 01:44:27 12/31/02
Go up one level in this thread
On December 30, 2002 at 18:12:29, Walter Faxon wrote: >On December 24, 2002 at 23:28:27, Matt Taylor wrote: > >>http://65.186.75.165/~mtaylor/bitscan/bitscan.html >> ><snip> >> >>-Matt > > >Hi, Matt. > >Well, here is Yet Another Bit Scanner (yabs), this time based on my 32-bit >folding code. Logically this one shouldn't require use of more than three >general purpose registers, and thus it should be about as fast as anything I've >posted before. Just for fun, I tried to be super-pedantic and thus guide any >reasonable late-Pentium compiler into producing near-optimal code. >Consequently, you should be able to hand-translate this into assembler in about >3 minutes, ignoring register choices and scheduling issues. Not necessarily better to use fewer registers. In the case of this routine, it runs into data dependency problems. From the original Pentium onward, all x86 processors have been able to execute multiple things simultaneously. The Pentium had capacity for 2 ops/cycle. Athlon and Pentium 4 can both execute 3 integer ops/cycle in addition to other ops. When computing, you obviously have to wait a cycle for data to get written back to registers. If your next step depends on the results of a computation you're doing now, it will limit your throughput, which is the major bottleneck of this routine. The mask and boolean test version was nice because I could test, set, and shift/accumulate in the same cycle. I'll see if I can't produce better code having done a bit of poking through optimization manuals since I originally translated that routine. The cmovcc was more of an afterthought for that routine, and VC didn't actually generate that part of its code -- was something I had to do. I'll see if it finds an even better solution. >I tried it both as an inlined routine and as a macro (not given here) which >would directly break out of a bit scanning loop without a surrounding test. > >Trying so hard turned out to be a mistake. Unfortunately, the compilers I use >(a slightly dated gcc and MetaWare's High-C, both well-recommended and run at >their highest optimization settings) have proven to be execrable pupils. I >could not have imagined such awful assembly code. I won't burden you or this >list with examples. > >Maybe your compilers are really much better. But you yourself have complained >about some of the stupid choices they make. I've always been impressed with the Microsoft VC compiler, personally. Other people on this list have expressed similar sentiments. One person even labelled it as the "benchmark compiler." I've always found GCC to generate good code as well. Those are the only two I ever use anymore -- GCC and its many ports for any Unix-related stuff I do and VC for Windows-related stuff. I should clarify my statements from earlier. I had trouble getting VC to generate efficient 64-bit code on IA-32. I still think it's a *good* compiler. The thing there was that I recognized that the masks were identical in the upper portion and lower portion, so I was able to do all the operations in native 32-bit. The VC compiler could have generated inefficient but better code even using the full 64-bit datatype -- the problem was that it decided to branch instead. >My goal was a fast bit-scanner in C. I was willing to be slightly pedantic for >a performance increment. But if I have to write in assembler anyway, with all >of its issues, why bother? > >I'm bummed. > >-- Walter Aye. The compiler should do the optimization. Don't get depressed over it, though. For now, the hack solution is to put a few things in inline assembly. Compilers are always improving -- evidence being that VC 7 can beat me in some (most) things. Eventually x86 will outlive its usefulness (at least, I hope) and newer architectures will allow compilers to be even more efficient. I'll recompile that code. I optimized the pmovmskb MMX routine a little, optimized my version of the 32-bit routine, and am adding a version that creates a bitmask and popcounts using pure integer code. (Basically it recognizes that if you popcount bit-1, you get the index of bit.) <snip that code> -Matt
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.