Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: Speed factors with 32 bit to 64 migration

Author: Gerd Isenberg

Date: 04:43:35 05/01/05

On May 01, 2005 at 05:15:57, Dieter Buerssner wrote:

>On April 30, 2005 at 05:19:03, Gerd Isenberg wrote:
>
>>On April 30, 2005 at 00:57:42, Steven Edwards wrote:
>>
>>>Moving a program from a 32 bit system to a 64 bit system does not by itself
>>>guarantee a speed up:
>>>
>>>1. If all pointers move from 32 to 64 bits, there may actually be a slow down
>>>due to possibly increased memory traffic.  I would assume that the 64 bit
>>>programming model will be 32 bit pointers with a combination of 32 bit and 64
>>>bit integers.
>>
>>Similar for x86-32 versus x86-64. Pointers are 64 bit - so saving/restoring
>>pointers to memory requires double space. Due to the implicit zero extension of
>>64-bit registers as operands of 32-bit operations, UInt32 is faster than Int32
>>as index variable for x86-64, because Int32 requires an explicit sign extension
>>to 64 bit.
>>
>>>
>>>2. In general, using 64 bit values instead of two 32 bit values will halve the
>>>register allocation requirement.
>>>
>>>3. All bitboard operations will speed up by a factor of two (at least).  Some
>>>minor low level re-coding may be needed to help with this.
>>
>>hmm... not sure about that for x86-64. A pair of 32-bit instructions may
>>processed in "parallel", while a single 64 bit instruction has higher
>>probability to stall.
>>
>>Also, on x86-64 64-bit operations require an aditionl prefix byte (same for
>>using the new r8-r15 registers). So you have often three opcode bytes versus
>>2*two opcode bytes.
>
>It may also depend, how you define "bitboard operation". If you define it at a
>higher level, for example a "popcount operation", depending on the algorithm
>used, there might be not much difference between 32 and 64 bit. For example the
>lookup table method, where you break the bitboard in (say) 8 bytes. Or even the
>typical
>
>  pc = 0;
>  while (bb)
>  {
>    pc++;
>    bb &= bb - 1;
>  }
>
>If you do this like above, a factor of 2 might result. A very small
>microoptimization for 32 bits is to do the loop on 2 32 bit parts of the
>bitboard, and the operation count is almost the same for 64 bit and 32 bit
>machines. There may be a little bit overhead for extracting the two parts
>(depending on inline situation, compiler, ... this may almost be neglible).
>
>Even the "magic mask and shift" popcount does not need the double operation
>count, because at later "rounds" of the algorithm, you can add 2 32 bit parts
>together and do the remaining algorithm on one 32 bit word.
>
>Even another example: in my engine I use some bitboards, where I will need
>several operations on the "King sqaure bit". One can add one outer if (sq >= 32)
>and duplicate the code, now with all operations on 32 bit words.
>
>About the operations in parallel: maybe it is even worse for the 32 version for
>sub and add. I think add adc pairs don't pair well, because the adc needs the
>carry bit already calculatet from the add. But or and and operations are
>probably much more dominant in bitboard programs, than add and sub.
>

Correct - most often bitboard add/sub occurs while isolating one bits.
You are right with add adc, sub sbb pairs, the carry/borrow dependency makes it
more worse than only register dependency in shld shl and of course even more
worse than independent pairs of bitwise operations.

The implicit independency with a pair of bitbise instructions, while opcode size
still favours 32-bit as well, implies a 64-bit advantage of less than two on
bitwise or,and,xor,not operations. Register allocation requirements are an
important issue here as well - in favour for 64-bit mode.

>>>
>>>4. Bitboard operations that require 64 bit shifts or rotations will speed up by
>>>(roughly) a factor of  four.
>>
>>On x86-64 that may be true for variable shifts. Constant 64-shifts for instance
>>to get pawn attacks, are not that bad with two (dependent) x86-32 instructions.
>
>Variable shifts will need some sort of "if", which might hurt. But that "double
>precicion shift" upcode comes in handy for 64 bit shifts.
>

Yes.

>Rotations cannot be done efficiently in Standard C in either case (but the
>compiler might have extensions to do it). When I tried, compilers could not
>optimize the typical C statements into rot instructions.
>
>>>
>>>5. Arithmetic operations on 64 bit integers will likely see a speed doubling for
>>>add/subtract and a quadrupling for multiply/divide.
>>
>>For 64-bit multiply/divide even more than doubling the speed, for add/sub most
>>likely less than doubling (similar to constant shifts, a bit worse than two
>>independent 32-bitwise operations).
>
>In an engine, one will rarely need multiply/devide (on bitboards). One example
>would be (using modulo instead of div)
>
>  hash_index = hash_key % number_of_entries;
>
>and this will be more than 4 times slower, than a 32 bit modulo. But it is not
>really needed. Just use the % on a 32 bit part of the hash_key (the exact result
>will not be the same, but it will work as good). BTW.: In this example,
>(assuming hash_key is 64 bit, number_of_entries and hash_index 32 bit) the x86
>instruction div, which devides 64 bit by 32 bit is not sufficient, because the
>devision will overflow in general (devide by zero exception results).
>
>For multiplication, I cannot imagine a good example in a chess engine. My other
>trick to calculate the hash index needs 32 * 32 -> 64 bit multiplication.

Yes ;-)

I intend to use 64*64=64bit multiplication for DeBruijn products with
from**2-to**2 bitboards.

>
> hash_index = ((unsigned long)hash_key * (unsigned long long)number_of_entries)
>>> 32;
>
>Compilers are able to generate just a normal "mul" instruction, and the result
>is in edx. Careful (but almost obvious) casting is needed, however, to give the
>compiler a hint, that really 32*32->64 bit multiplication is needed.
>
>For real 64 bit multiplications/devisions: How is the cycle count for these
>instructions on AMD64? I would fear, that it needs quite a few more cycles, than
>32 bit counter parts.

IMUL mreg16       F7h 11-101-xxx VectorPath  4
IMUL mreg32/64    F7h 11-101-xxx Double         3/ 5
MUL  mreg16       F7h 11-100-xxx VectorPath  4
MUL  mreg32/64    F7h 11-100-xxx Double         3/ 5
DIV  mreg16/32/64 F7h 11-110-xxx VectorPath 23/39/71
IDIV mreg16/32/64 F7h 11-111-xxx VectorPath 26/42/74

Cheers,
Gerd

>
>Cheers,
>Dieter

Re: Speed factors with 32 bit to 64 migration Dieter Buerssner 05:14:16 05/01/05
- Re: Speed factors with 32 bit to 64 migration Gerd Isenberg 06:43:52 05/01/05
- Re: Speed factors with 32 bit to 64 migration Gerd Isenberg 06:09:40 05/01/05
  - Re: Speed factors with 32 bit to 64 migration Dieter Buerssner 10:43:45 05/01/05
    - Re: Speed factors with 32 bit to 64 migration Gerd Isenberg 23:26:42 05/01/05
      - Re: Speed factors with 32 bit to 64 migration Dieter Buerssner 12:08:17 05/02/05
        
        Re: Speed factors with 32 bit to 64 migration Gerd Isenberg 14:21:08 05/02/05
        
        Re: Speed factors with 32 bit to 64 migration Gerd Isenberg 04:34:36 05/04/05

This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.