Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Speaking of low bit...use the CMOVcc instructions!

Author: Wylie Garvin
Date: 21:33:09 01/23/02
On January 23, 2002 at 00:59:58, Eugene Nalimov wrote:

>On January 23, 2002 at 00:22:12, Robert Hyatt wrote:
>
>>On January 22, 2002 at 12:22:15, Leen Ammeraal wrote:
>>
>>>Hi Miguel,
>>>I tested Dann's version, LOWBIT, not your routine.
>>>I forgot to write that I admire
>>>this almost magical algorithm very much and that I would
>>>be very pleased if someone could explain it.
>>
>>
>>The idea is very simple.  Test the end of the 64 bit value that has
>>bits 32-63 with a bit scan instruction (which will give a result of
>>0-31.  Add 32 to this to adjust to 32-63.  Now test the other end.  If
>>the bit scan finds no bits set, it will leave the dest register unchanged
>>which leaves the result of your high-order end in there.  If it does find
>>a bit set (0-31) it will overwrite the destination and the result will be
>>0-31.
>>
>>Works well.  Not quite so easy to do for "hibit" as a branch is going to
>>be necessary...
>
>Unfortunately, Intel's documentation does not say "Unchanged if there is no bit
>set". It says "Undefined". Maybe on the current CPUs it works, but I strongly
>recommend against such tricks.
>
>Eugene
>

As Eugene says, the Intel documentation says "undefined".  There have been
numerous instances in the history of Intel chips where certain behaviour turned
out to be reliable in practice even though Intel said it was "undefined", either
because they envisioned the possible need to change the behaviour in the future,
or simply wanted competitors to produce slightly incompatible chips (a very few
instructions were never documented, even though successive generations of Intel
chips continued to support them).

For the record, all Intel chips and clones since the P6 architecture support
conditional-move instructions CMOVcc, which can be used to avoid the undefined
behaviour here.  I.e. you could do something like:

        mov ecx,dword ptr [bits+4]
        mov ebx,dword ptr [bits]
        bsf ecx,ecx
        add ecx,32
        bsf eax,ebx       ; <-- only if this fails..
        cmovz eax,ecx     ; <-- will this be executed.

I haven't checked to make sure, but from reading Intel docs I believe the cmovz
here costs the same as an ordinary move.  HOWEVER, the dependency chain between
the second cmovz and the bsf means they can't be overlapped.  The C compiler
will not mix other insns with your hand-coded assembly; unless you write the
routine in assembly, can't do much about that.

Anyway, I suggest using this variant, if it doesn't slow things down noticably.
No one wants to run this code on a Pentium or lower anyway, because the bit-scan
instruction BSF was *brutally* slow before the P6, when it became pretty cheap.

>>
>>
>>
>>If your
>>>experiments show that your version is better, please let
>>>me (or rather, us) know. I remember that this assembly
>>>routine was faster than the C(++) routine that I replaced
>>>(a long time ago).
>>>By the way, it is a long time since I wrote assembly
>>>language myself, and I don't understand this routine
>>>either. In particular, I find it miraculous that it
>>>contains neither a loop nor a table-lookup. Any explanation
>>>would be very welcome, preferably in a C equivalent,
>>>if possible, just to show how it works.
>>>
>>>Leen
>>>
>>>
>>>
>>>On January 22, 2002 at 12:01:07, Miguel A. Ballicora wrote:
>>>
>>>>On January 22, 2002 at 11:23:36, Leen Ammeraal wrote:
>>>>
>>>>>I did an experiment with your function lowbit, adapting
>>>>>it (by using a union of both an __int64 and a
>>>>>struct of two ints, in view of 64 bit bitboards)
>>>>>but it was slightly slower (with very little difference)
>>>>>than the following assembly routine which you (Dann) also
>>>>>provided:
>>>>>
>>>>>__forceinline
>>>>>int FirstPiece(BITBOARD bits)
>>>>>{
>>>>>//    Given is: bits != 0
>>>>>__asm {
>>>>>;       64-bit number to move into ECX:EBX, will never be zero!
>>>>>        mov ecx,dword ptr [bits+4]
>>>>>        mov ebx,dword ptr [bits]
>>>>>        bsf eax,ecx
>>>>>        add eax,32
>>>>>        bsf eax,ebx
>>>>>      }
>>>>>}
>>>>>
>>>>>So I think I will stick to this assembly function.
>>>>>Leen
>>>>
>>>>Hi Leen,
>>>>
>>>>I am a bit confused. You answer to Dann but you talk about the function lowbit()
>>>>I posted a function lowbit and Dann posted a macro LOWBIT (by L.Kirby).
>>>>Which one you say you tested and is slightly slower than the assembler version?
>>>>
>>>>If the difference is not high, maybe the C version of (LOWBIT or lowbit or
>>>>whatever) could be optimized and or coded in assembler.
>>>>
>>>>Regards,
>>>>Miguel
>>>>PS: did you test lowbit2?
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.