Author: Dave Gomboc
Date: 23:29:52 12/31/02
Go up one level in this thread
On December 31, 2002 at 22:30:20, Matt Taylor wrote: >On December 31, 2002 at 20:21:38, Dave Gomboc wrote: > >>On December 31, 2002 at 14:16:47, Matt Taylor wrote: >> >>>On December 31, 2002 at 04:44:27, Matt Taylor wrote: >>> >>>>On December 30, 2002 at 18:12:29, Walter Faxon wrote: >>>> >>>>>On December 24, 2002 at 23:28:27, Matt Taylor wrote: >>>>> >>>>>>http://65.186.75.165/~mtaylor/bitscan/bitscan.html >>>>>> >>>>><snip> >>>>>> >>>>>>-Matt >>>>> >>>>> >>>>>Hi, Matt. >>>>> >>>>>Well, here is Yet Another Bit Scanner (yabs), this time based on my 32-bit >>>>>folding code. Logically this one shouldn't require use of more than three >>>>>general purpose registers, and thus it should be about as fast as anything I've >>>>>posted before. Just for fun, I tried to be super-pedantic and thus guide any >>>>>reasonable late-Pentium compiler into producing near-optimal code. >>>>>Consequently, you should be able to hand-translate this into assembler in about >>>>>3 minutes, ignoring register choices and scheduling issues. >>> >>>Eek. VC chose a branch again. It also did something rather funny: >>> >>> ... >>> mov ecx, DWORD PTR [esi] >>> test ecx, ecx >>> sete dl >>> test ecx, ecx >>> mov DWORD PTR _index$[esp+8], 0 >>> push edi >>> mov BYTE PTR _index$[esp+12], dl >>> jne SHORT $L73441 >>> ... >>> >>>Note the duplication of the "test ecx, ecx" sequence. I guess it didn't realize >>>that the flags don't get trashed by setcc. >>> >>>Anyway, you might have been too pedantic, if there is such a thing. The VC 7 >>>compiler with full optimization targetted to Pentium Pro and beyond did not >>>recognize the cmov possibility here. Since cmov is cheap (1 clock >>>latency/throughput just like ALU instructions), I'm going to tweak the code a >>>bit. >>> >>>Oh, and FYI: either because of your rigorous attempt to point out optimization >>>potential or because of the different sequence to mask out the LSB, the 32-bit >>>version of the routine is now faster by 2 clocks than the bsf instruction on my >>>Athlon. I'm not really sure how much of a speed-up it is %-wise because I ran >>>into issues with the timing code. Things were executing in negative time on >>>Intel (P6/P7) chips when I tried to subtract out the timing bias, so I finally >>>gave up and left it in. (Amusingly this was after I replaced my timing code with >>>Intel's timing code.) >>> >>>-Matt >> >>AFAIK, VC never generates code that can't be run on a 386. What it will do is >>choose an instruction order that runs fastest on the processor being optimized >>for. >> >>Dave > >Probably true, though I thought I had seen it would generate the cmovcc >instructions. I don't see why they care about a 386, though. It's not like Win32 >will actually run on a 386... > >-Matt Last time I tried, it worked... extremely slowly! ;-) MSVC _could_ generate P-II+ instructions if it guards them by a test for the type of processor (and also generates other code for older chips to execute). Eugene Nalimov might well know (then again, he might be under NDA. ;-) Dave
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.