Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Beating Bitscan to Death -- yet again

Author: Dave Gomboc

Date: 23:29:52 12/31/02

Go up one level in this thread


On December 31, 2002 at 22:30:20, Matt Taylor wrote:

>On December 31, 2002 at 20:21:38, Dave Gomboc wrote:
>
>>On December 31, 2002 at 14:16:47, Matt Taylor wrote:
>>
>>>On December 31, 2002 at 04:44:27, Matt Taylor wrote:
>>>
>>>>On December 30, 2002 at 18:12:29, Walter Faxon wrote:
>>>>
>>>>>On December 24, 2002 at 23:28:27, Matt Taylor wrote:
>>>>>
>>>>>>http://65.186.75.165/~mtaylor/bitscan/bitscan.html
>>>>>>
>>>>><snip>
>>>>>>
>>>>>>-Matt
>>>>>
>>>>>
>>>>>Hi, Matt.
>>>>>
>>>>>Well, here is Yet Another Bit Scanner (yabs), this time based on my 32-bit
>>>>>folding code.  Logically this one shouldn't require use of more than three
>>>>>general purpose registers, and thus it should be about as fast as anything I've
>>>>>posted before.  Just for fun, I tried to be super-pedantic and thus guide any
>>>>>reasonable late-Pentium compiler into producing near-optimal code.
>>>>>Consequently, you should be able to hand-translate this into assembler in about
>>>>>3 minutes, ignoring register choices and scheduling issues.
>>>
>>>Eek. VC chose a branch again. It also did something rather funny:
>>>
>>>	...
>>>	mov	ecx, DWORD PTR [esi]
>>>	test	ecx, ecx
>>>	sete	dl
>>>	test	ecx, ecx
>>>	mov	DWORD PTR _index$[esp+8], 0
>>>	push	edi
>>>	mov	BYTE PTR _index$[esp+12], dl
>>>	jne	SHORT $L73441
>>>	...
>>>
>>>Note the duplication of the "test ecx, ecx" sequence. I guess it didn't realize
>>>that the flags don't get trashed by setcc.
>>>
>>>Anyway, you might have been too pedantic, if there is such a thing. The VC 7
>>>compiler with full optimization targetted to Pentium Pro and beyond did not
>>>recognize the cmov possibility here. Since cmov is cheap (1 clock
>>>latency/throughput just like ALU instructions), I'm going to tweak the code a
>>>bit.
>>>
>>>Oh, and FYI: either because of your rigorous attempt to point out optimization
>>>potential or because of the different sequence to mask out the LSB, the 32-bit
>>>version of the routine is now faster by 2 clocks than the bsf instruction on my
>>>Athlon. I'm not really sure how much of a speed-up it is %-wise because I ran
>>>into issues with the timing code. Things were executing in negative time on
>>>Intel (P6/P7) chips when I tried to subtract out the timing bias, so I finally
>>>gave up and left it in. (Amusingly this was after I replaced my timing code with
>>>Intel's timing code.)
>>>
>>>-Matt
>>
>>AFAIK, VC never generates code that can't be run on a 386.  What it will do is
>>choose an instruction order that runs fastest on the processor being optimized
>>for.
>>
>>Dave
>
>Probably true, though I thought I had seen it would generate the cmovcc
>instructions. I don't see why they care about a 386, though. It's not like Win32
>will actually run on a 386...
>
>-Matt

Last time I tried, it worked... extremely slowly! ;-)

MSVC _could_ generate P-II+ instructions if it guards them by a test for the
type of processor (and also generates other code for older chips to execute).
Eugene Nalimov might well know (then again, he might be under NDA. ;-)

Dave



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.