Author: Gerd Isenberg
Date: 23:26:42 05/01/05
Go up one level in this thread
On May 01, 2005 at 13:43:45, Dieter Buerssner wrote:
>On May 01, 2005 at 09:09:40, Gerd Isenberg wrote:
>
>>On May 01, 2005 at 08:14:16, Dieter Buerssner wrote:
>>
>>>On May 01, 2005 at 07:43:35, Gerd Isenberg wrote:
>>>
>>>>On May 01, 2005 at 05:15:57, Dieter Buerssner wrote:
>>>>
>>>>>For real 64 bit multiplications/devisions: How is the cycle count for these
>>> ^ that typo looks horrible.
>>>>>instructions on AMD64? I would fear, that it needs quite a few more cycles, than
>>>>>32 bit counter parts.
>>>>
>> Syntax Encoding Decode Latency
>> First ModRM type
>>>>IMUL mreg16 F7h 11-101-xxx VectorPath 4
>>>>IMUL mreg32/64 F7h 11-101-xxx Double 3/ 5
>>>>MUL mreg16 F7h 11-100-xxx VectorPath 4
>>>>MUL mreg32/64 F7h 11-100-xxx Double 3/ 5
>>>>DIV mreg16/32/64 F7h 11-110-xxx VectorPath 23/39/71
>>>>IDIV mreg16/32/64 F7h 11-111-xxx VectorPath 26/42/74
>>>
>>>Gerd, thanks for positing the numbers. I am not sure, I understand the table.
>>>F7h is the upcode prefix for 64 bit operations?
>>
>>No - the first opcode byte of mul/div with implicit ax/eax/rax dx/edx/rdx
>>operands. 64-bit prefix is extra.
>
>Ok, then it is clear. I was confused about the F7 in every instruction, because
>it was rather clear that some will need a prefix ...
>
>Multiplication looks very fast. I remember the 386 days, when I would know the
>cycles for many upcodes by heart. I forgot much of this - I mean to remember
>that MUL reg32 use ~40 cycles. ADD and ADC used 2 cycles. Those were also the
>days, where one rather reliably could just count cycles. In a multi-precision
>library for multiplication, it was almost enough to count the muls (and for
>division muls and divs). The relation div/mul cycles was much better then, IIRC.
>That was typical for x86 - other processors did not have div (at least no
>remainder). Also, quite surprisingly, shifts and especially rot with carry
>(which was useful for some multiprecision routines) were very slow on x86 then.
>
>Sorry for the excurse from the initial subject of Steven Edwards.
>
>Thanks for your answers,
>Dieter
Hi Dieter,
yes, mul has become incredible fast on amd64 - except 16-bit mul no more vector
path and 3/5 cycles for 32/64-bit multiplication. I wonder whether Matt Taylor'S
folding trick pays off for AMD64 or whether the pure 64-bit de Bruijn
multiplication is faster. The folding trick works fine, if the bitboard is kept
in 32-bit manner in two 32-bit registers. Bitboard inside one 64-bit register
requires the shift right by 32.
http://supertech.csail.mit.edu/papers/debruijn.pdf
int bitScan(BitBoard bb)
{
bb &= -bb;
return lookUp[(bb * deBruijn64) >> (64-6)]; // range is 0..63
}
int bitScan(BitBoard bb)
{
bb ^= (bb - 1);
unsigned int foldedLSB = ((int) bb) ^ ((int)(bb>>32));
return lookUp[(foldedLSB * 0x78291ACF) >> (32-6)]; // range is 0..63
}
Cheers,
Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.