Author: Matt Taylor
Date: 00:08:27 01/01/03
Go up one level in this thread
On January 01, 2003 at 02:29:52, Dave Gomboc wrote: >On December 31, 2002 at 22:30:20, Matt Taylor wrote: > >>On December 31, 2002 at 20:21:38, Dave Gomboc wrote: >> >>>On December 31, 2002 at 14:16:47, Matt Taylor wrote: >>> >>>>On December 31, 2002 at 04:44:27, Matt Taylor wrote: >>>> >>>>>On December 30, 2002 at 18:12:29, Walter Faxon wrote: >>>>> >>>>>>On December 24, 2002 at 23:28:27, Matt Taylor wrote: >>>>>> >>>>>>>http://65.186.75.165/~mtaylor/bitscan/bitscan.html >>>>>>> >>>>>><snip> >>>>>>> >>>>>>>-Matt >>>>>> >>>>>> >>>>>>Hi, Matt. >>>>>> >>>>>>Well, here is Yet Another Bit Scanner (yabs), this time based on my 32-bit >>>>>>folding code. Logically this one shouldn't require use of more than three >>>>>>general purpose registers, and thus it should be about as fast as anything I've >>>>>>posted before. Just for fun, I tried to be super-pedantic and thus guide any >>>>>>reasonable late-Pentium compiler into producing near-optimal code. >>>>>>Consequently, you should be able to hand-translate this into assembler in about >>>>>>3 minutes, ignoring register choices and scheduling issues. >>>> >>>>Eek. VC chose a branch again. It also did something rather funny: >>>> >>>> ... >>>> mov ecx, DWORD PTR [esi] >>>> test ecx, ecx >>>> sete dl >>>> test ecx, ecx >>>> mov DWORD PTR _index$[esp+8], 0 >>>> push edi >>>> mov BYTE PTR _index$[esp+12], dl >>>> jne SHORT $L73441 >>>> ... >>>> >>>>Note the duplication of the "test ecx, ecx" sequence. I guess it didn't realize >>>>that the flags don't get trashed by setcc. >>>> >>>>Anyway, you might have been too pedantic, if there is such a thing. The VC 7 >>>>compiler with full optimization targetted to Pentium Pro and beyond did not >>>>recognize the cmov possibility here. Since cmov is cheap (1 clock >>>>latency/throughput just like ALU instructions), I'm going to tweak the code a >>>>bit. >>>> >>>>Oh, and FYI: either because of your rigorous attempt to point out optimization >>>>potential or because of the different sequence to mask out the LSB, the 32-bit >>>>version of the routine is now faster by 2 clocks than the bsf instruction on my >>>>Athlon. I'm not really sure how much of a speed-up it is %-wise because I ran >>>>into issues with the timing code. Things were executing in negative time on >>>>Intel (P6/P7) chips when I tried to subtract out the timing bias, so I finally >>>>gave up and left it in. (Amusingly this was after I replaced my timing code with >>>>Intel's timing code.) >>>> >>>>-Matt >>> >>>AFAIK, VC never generates code that can't be run on a 386. What it will do is >>>choose an instruction order that runs fastest on the processor being optimized >>>for. >>> >>>Dave >> >>Probably true, though I thought I had seen it would generate the cmovcc >>instructions. I don't see why they care about a 386, though. It's not like Win32 >>will actually run on a 386... >> >>-Matt > >Last time I tried, it worked... extremely slowly! ;-) > >MSVC _could_ generate P-II+ instructions if it guards them by a test for the >type of processor (and also generates other code for older chips to execute). >Eugene Nalimov might well know (then again, he might be under NDA. ;-) > >Dave Ehh...it's usually not worth generating the code for a newer processor unless you're talking about multiple binaries. Checking is annoying, and if you check too often, the overhead defeats any performance gains you get. It would be trivial to distribute multiple binary copies of your application based on the end user's system. It would also be easy to distribute a single binary and implement a scheme that would conditionally modify the code if the user had an older processor (replacing unsupported instructions with equivalent sequences). -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.