Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Speed factors with 32 bit to 64 migration

Author: Gerd Isenberg
Date: 14:21:08 05/02/05
On May 02, 2005 at 15:08:17, Dieter Buerssner wrote:

>On May 02, 2005 at 02:26:42, Gerd Isenberg wrote:
>
>Both routines look very fast on 64 bit. But wouldn't the bitscan upcode (I
>forgot the name for the moment - bsr?) be even faster? No dirty tricks needed
>anymore or branches. Needs just 1 register (when the bb can be destroyed)? I am
>aware, that MSVC will not support inline assembly anymore, but the intrinsic
>should do it at least as efficiently?

Bob may already know - may be due to codesize, lookup and only using one
register. Anyway - 9 cycles vector path for bsf, bsr is a shame ;-)

The "boolean" Intrinsic prototype implies some additional jump - even if always
(not) taken:

unsigned char _BitScanForward64(
unsigned long *Index,
unsigned __int64Mask
);

The 2 operand 32/64 imul is direct path (and not double direct path as the
64/128-bit result equivalents) and only 3/4! cycles latency.
If one can stay with clumsy bitindices ;-) i'll bet the deBruijn mul/shift with
leading neg/and is competitive with following assembly:

  mov    rax, QWORD PTR _bb$[esp]
  mov    rdx, rax
  mov    rcx, DebruijnConstant64  ; or load 64-bit constant from memory
  neg    rax
  and    rax, rdx
  imul   rax, rcx
  shr    rax, 58

or with only two registers but more dependencies:

  mov    rax, QWORD PTR _bb$[esp]
  mov    rdx, rax
  neg    rax
  and    rax, rdx
  mov    rdx, DebruijnConstant64  ; or load 64-bit constant from memory
  imul   rax, rdx
  shr    rax, 58


>
>
>>http://supertech.csail.mit.edu/papers/debruijn.pdf
>
>>int bitScan(BitBoard bb)
>>{
>> bb &= -bb;
>
>1 mov, 1 neg, 1 and
>
>> return lookUp[(bb * deBruijn64) >> (64-6)]; // range is 0..63
>
>without the table lookup: 1 64 bit mul, 1 shift.
>
>>}
>>
>>int bitScan(BitBoard bb)
>>{
>> bb ^= (bb - 1);
>
>1 mov, 1 dec, 1 xor
>
>> unsigned int foldedLSB = ((int) bb) ^ ((int)(bb>>32));
>
>1 mov 1 shift, 1 xor, also one operation for the casts? Or can you just use eax
>from former rax?

Targeting eax clears upper dword of rax. AFAIK there is no panalty using
registers alternately 32- or 64-bit wise. 32-bit instructions are still the
shortest and fastest. As address operand only 64-bit registers. Therefor the
optimization hint with unsigned int as index - zero extension is for nothing,
but 32/64-bit sign extension must be done explicitly with an additional
instruction.

Targeting 16-bit ax leaves the upper word of eax unchanged (but clears high rax
as well - iirc). The processer stalls on reading ah after writing al or vice
versa.

>Wasn't there some small penalty on 32 bit for doing so?

One additional prefix for 16-bit is penalty enough ;-)


>I would cast to unsigned int (or long, when 16 bit compiler compability is
>wanted) - but it won't matter on typical hardware.
>
>> return lookUp[(foldedLSB * 0x78291ACF) >> (32-6)]; // range is 0..63
>
>1 64 bit mul, 1 shift. Mov of the constant to some register not counted (one
>could use a global, as above - or above might be const, then C++ compilers often
>will not store the data anywhere).
>
>>}
>
>1 shift and mov operation more, 32 vs 64 bit mul, other operations about equal.
>Seems almost a dead race?  Both routines will only need two registers in the
>optimal case? (ax and dx).

The folded one needs additional shift right 32 and xor for shorter shr and
shorter and faster imul.

>
>Cheers,
>Dieter

the 32-bit code without lookup:

int bitScan(BitBoard bb)
{
 bb ^= (bb - 1);
 unsigned int foldedLSB = ((int) bb) ^ ((int)(bb>>32));
 return (foldedLSB * 0x78291ACF) >> (32-6); // range is 0..63
}

?bitScan@@YAH_K@Z PROC                              ; bitScan
  00530	8b 4c 24 04       mov    ecx, DWORD PTR _bb$[esp-4]
  00534	8b 44 24 08       mov    eax, DWORD PTR _bb$[esp]
  00538	56                push   esi
  00539	8b f0             mov    esi, eax
  0053b	8b d1             mov    edx, ecx
  0053d	83 c2 ff          add    edx, -1
  00540	83 d6 ff          adc    esi, -1
  00543	33 c6             xor    eax, esi
  00545	33 ca             xor    ecx, edx
  00547	33 c1             xor    eax, ecx
  00549	69 c0 cf 1a 29 78 imul   eax, 78291acfH
  0054f	c1 e8 1a          shr    eax, 26
  00552	5e                pop    esi
  00553	c3                ret    0


Cheers,
Gerd
Re: Speed factors with 32 bit to 64 migration Gerd Isenberg 04:34:36 05/04/05
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.