Author: Gerd Isenberg
Date: 14:21:08 05/02/05
Go up one level in this thread
On May 02, 2005 at 15:08:17, Dieter Buerssner wrote:
>On May 02, 2005 at 02:26:42, Gerd Isenberg wrote:
>
>Both routines look very fast on 64 bit. But wouldn't the bitscan upcode (I
>forgot the name for the moment - bsr?) be even faster? No dirty tricks needed
>anymore or branches. Needs just 1 register (when the bb can be destroyed)? I am
>aware, that MSVC will not support inline assembly anymore, but the intrinsic
>should do it at least as efficiently?
Bob may already know - may be due to codesize, lookup and only using one
register. Anyway - 9 cycles vector path for bsf, bsr is a shame ;-)
The "boolean" Intrinsic prototype implies some additional jump - even if always
(not) taken:
unsigned char _BitScanForward64(
unsigned long *Index,
unsigned __int64Mask
);
The 2 operand 32/64 imul is direct path (and not double direct path as the
64/128-bit result equivalents) and only 3/4! cycles latency.
If one can stay with clumsy bitindices ;-) i'll bet the deBruijn mul/shift with
leading neg/and is competitive with following assembly:
mov rax, QWORD PTR _bb$[esp]
mov rdx, rax
mov rcx, DebruijnConstant64 ; or load 64-bit constant from memory
neg rax
and rax, rdx
imul rax, rcx
shr rax, 58
or with only two registers but more dependencies:
mov rax, QWORD PTR _bb$[esp]
mov rdx, rax
neg rax
and rax, rdx
mov rdx, DebruijnConstant64 ; or load 64-bit constant from memory
imul rax, rdx
shr rax, 58
>
>
>>http://supertech.csail.mit.edu/papers/debruijn.pdf
>
>>int bitScan(BitBoard bb)
>>{
>> bb &= -bb;
>
>1 mov, 1 neg, 1 and
>
>> return lookUp[(bb * deBruijn64) >> (64-6)]; // range is 0..63
>
>without the table lookup: 1 64 bit mul, 1 shift.
>
>>}
>>
>>int bitScan(BitBoard bb)
>>{
>> bb ^= (bb - 1);
>
>1 mov, 1 dec, 1 xor
>
>> unsigned int foldedLSB = ((int) bb) ^ ((int)(bb>>32));
>
>1 mov 1 shift, 1 xor, also one operation for the casts? Or can you just use eax
>from former rax?
Targeting eax clears upper dword of rax. AFAIK there is no panalty using
registers alternately 32- or 64-bit wise. 32-bit instructions are still the
shortest and fastest. As address operand only 64-bit registers. Therefor the
optimization hint with unsigned int as index - zero extension is for nothing,
but 32/64-bit sign extension must be done explicitly with an additional
instruction.
Targeting 16-bit ax leaves the upper word of eax unchanged (but clears high rax
as well - iirc). The processer stalls on reading ah after writing al or vice
versa.
>Wasn't there some small penalty on 32 bit for doing so?
One additional prefix for 16-bit is penalty enough ;-)
>I would cast to unsigned int (or long, when 16 bit compiler compability is
>wanted) - but it won't matter on typical hardware.
>
>> return lookUp[(foldedLSB * 0x78291ACF) >> (32-6)]; // range is 0..63
>
>1 64 bit mul, 1 shift. Mov of the constant to some register not counted (one
>could use a global, as above - or above might be const, then C++ compilers often
>will not store the data anywhere).
>
>>}
>
>1 shift and mov operation more, 32 vs 64 bit mul, other operations about equal.
>Seems almost a dead race? Both routines will only need two registers in the
>optimal case? (ax and dx).
The folded one needs additional shift right 32 and xor for shorter shr and
shorter and faster imul.
>
>Cheers,
>Dieter
the 32-bit code without lookup:
int bitScan(BitBoard bb)
{
bb ^= (bb - 1);
unsigned int foldedLSB = ((int) bb) ^ ((int)(bb>>32));
return (foldedLSB * 0x78291ACF) >> (32-6); // range is 0..63
}
?bitScan@@YAH_K@Z PROC ; bitScan
00530 8b 4c 24 04 mov ecx, DWORD PTR _bb$[esp-4]
00534 8b 44 24 08 mov eax, DWORD PTR _bb$[esp]
00538 56 push esi
00539 8b f0 mov esi, eax
0053b 8b d1 mov edx, ecx
0053d 83 c2 ff add edx, -1
00540 83 d6 ff adc esi, -1
00543 33 c6 xor eax, esi
00545 33 ca xor ecx, edx
00547 33 c1 xor eax, ecx
00549 69 c0 cf 1a 29 78 imul eax, 78291acfH
0054f c1 e8 1a shr eax, 26
00552 5e pop esi
00553 c3 ret 0
Cheers,
Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.