Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Speed factors with 32 bit to 64 migration

Author: Gerd Isenberg

Date: 23:26:42 05/01/05

Go up one level in this thread


On May 01, 2005 at 13:43:45, Dieter Buerssner wrote:

>On May 01, 2005 at 09:09:40, Gerd Isenberg wrote:
>
>>On May 01, 2005 at 08:14:16, Dieter Buerssner wrote:
>>
>>>On May 01, 2005 at 07:43:35, Gerd Isenberg wrote:
>>>
>>>>On May 01, 2005 at 05:15:57, Dieter Buerssner wrote:
>>>>
>>>>>For real 64 bit multiplications/devisions: How is the cycle count for these
>>>                                   ^ that typo looks horrible.
>>>>>instructions on AMD64? I would fear, that it needs quite a few more cycles, than
>>>>>32 bit counter parts.
>>>>
>>  Syntax          Encoding       Decode      Latency
>>                  First ModRM      type
>>>>IMUL mreg16       F7h 11-101-xxx VectorPath  4
>>>>IMUL mreg32/64    F7h 11-101-xxx Double         3/ 5
>>>>MUL  mreg16       F7h 11-100-xxx VectorPath  4
>>>>MUL  mreg32/64    F7h 11-100-xxx Double         3/ 5
>>>>DIV  mreg16/32/64 F7h 11-110-xxx VectorPath 23/39/71
>>>>IDIV mreg16/32/64 F7h 11-111-xxx VectorPath 26/42/74
>>>
>>>Gerd, thanks for positing the numbers. I am not sure, I understand the table.
>>>F7h is the upcode prefix for 64 bit operations?
>>
>>No - the first opcode byte of mul/div with implicit ax/eax/rax dx/edx/rdx
>>operands. 64-bit prefix is extra.
>
>Ok, then it is clear. I was confused about the F7 in every instruction, because
>it was rather clear that some will need a prefix ...
>
>Multiplication looks very fast. I remember the 386 days, when I would know the
>cycles for many upcodes by heart. I forgot much of this - I mean to remember
>that MUL reg32 use ~40 cycles. ADD and ADC used 2 cycles. Those were also the
>days, where one rather reliably could just count cycles. In a multi-precision
>library for multiplication, it was almost enough to count the muls (and for
>division muls and divs). The relation div/mul cycles was much better then, IIRC.
>That was typical for x86 - other processors did not have div (at least no
>remainder). Also, quite surprisingly, shifts and especially rot with carry
>(which was useful for some multiprecision routines) were very slow on x86 then.
>
>Sorry for the excurse from the initial subject of Steven Edwards.
>
>Thanks for your answers,
>Dieter

Hi Dieter,

yes, mul has become incredible fast on amd64 - except 16-bit mul no more vector
path and 3/5 cycles for 32/64-bit multiplication. I wonder whether Matt Taylor'S
folding trick pays off for AMD64 or whether the pure 64-bit de Bruijn
multiplication is faster. The folding trick works fine, if the bitboard is kept
in 32-bit manner in two 32-bit registers. Bitboard inside one 64-bit register
requires the shift right by 32.

http://supertech.csail.mit.edu/papers/debruijn.pdf

int bitScan(BitBoard bb)
{
 bb &= -bb;
 return lookUp[(bb * deBruijn64) >> (64-6)]; // range is 0..63
}

int bitScan(BitBoard bb)
{
 bb ^= (bb - 1);
 unsigned int foldedLSB = ((int) bb) ^ ((int)(bb>>32));
 return lookUp[(foldedLSB * 0x78291ACF) >> (32-6)]; // range is 0..63
}

Cheers,
Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.