Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Speed factors with 32 bit to 64 migration

Author: Dieter Buerssner

Date: 12:08:17 05/02/05

Go up one level in this thread


On May 02, 2005 at 02:26:42, Gerd Isenberg wrote:

Hi Gerd,

>yes, mul has become incredible fast on amd64 - except 16-bit mul no more vector
>path and 3/5 cycles for 32/64-bit multiplication. I wonder whether Matt Taylor'S
>folding trick pays off for AMD64 or whether the pure 64-bit de Bruijn
>multiplication is faster. The folding trick works fine, if the bitboard is kept
>in 32-bit manner in two 32-bit registers. Bitboard inside one 64-bit register
>requires the shift right by 32.

Both routines look very fast on 64 bit. But wouldn't the bitscan upcode (I
forgot the name for the moment - bsr?) be even faster? No dirty tricks needed
anymore or branches. Needs just 1 register (when the bb can be destroyed)? I am
aware, that MSVC will not support inline assembly anymore, but the intrinsic
should do it at least as efficiently?


>http://supertech.csail.mit.edu/papers/debruijn.pdf

>int bitScan(BitBoard bb)
>{
> bb &= -bb;

1 mov, 1 neg, 1 and

> return lookUp[(bb * deBruijn64) >> (64-6)]; // range is 0..63

without the table lookup: 1 64 bit mul, 1 shift.

>}
>
>int bitScan(BitBoard bb)
>{
> bb ^= (bb - 1);

1 mov, 1 dec, 1 xor

> unsigned int foldedLSB = ((int) bb) ^ ((int)(bb>>32));

1 mov 1 shift, 1 xor, also one operation for the casts? Or can you just use eax
from former rax? Wasn't there some small penalty on 32 bit for doing so?
I would cast to unsigned int (or long, when 16 bit compiler compability is
wanted) - but it won't matter on typical hardware.

> return lookUp[(foldedLSB * 0x78291ACF) >> (32-6)]; // range is 0..63

1 64 bit mul, 1 shift. Mov of the constant to some register not counted (one
could use a global, as above - or above might be const, then C++ compilers often
will not store the data anywhere).

>}

1 shift and mov operation more, 32 vs 64 bit mul, other operations about equal.
Seems almost a dead race?  Both routines will only need two registers in the
optimal case? (ax and dx).

Cheers,
Dieter



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.