Author: Dieter Buerssner
Date: 02:15:57 05/01/05
Go up one level in this thread
On April 30, 2005 at 05:19:03, Gerd Isenberg wrote:
>On April 30, 2005 at 00:57:42, Steven Edwards wrote:
>
>>Moving a program from a 32 bit system to a 64 bit system does not by itself
>>guarantee a speed up:
>>
>>1. If all pointers move from 32 to 64 bits, there may actually be a slow down
>>due to possibly increased memory traffic. I would assume that the 64 bit
>>programming model will be 32 bit pointers with a combination of 32 bit and 64
>>bit integers.
>
>Similar for x86-32 versus x86-64. Pointers are 64 bit - so saving/restoring
>pointers to memory requires double space. Due to the implicit zero extension of
>64-bit registers as operands of 32-bit operations, UInt32 is faster than Int32
>as index variable for x86-64, because Int32 requires an explicit sign extension
>to 64 bit.
>
>>
>>2. In general, using 64 bit values instead of two 32 bit values will halve the
>>register allocation requirement.
>>
>>3. All bitboard operations will speed up by a factor of two (at least). Some
>>minor low level re-coding may be needed to help with this.
>
>hmm... not sure about that for x86-64. A pair of 32-bit instructions may
>processed in "parallel", while a single 64 bit instruction has higher
>probability to stall.
>
>Also, on x86-64 64-bit operations require an aditionl prefix byte (same for
>using the new r8-r15 registers). So you have often three opcode bytes versus
>2*two opcode bytes.
It may also depend, how you define "bitboard operation". If you define it at a
higher level, for example a "popcount operation", depending on the algorithm
used, there might be not much difference between 32 and 64 bit. For example the
lookup table method, where you break the bitboard in (say) 8 bytes. Or even the
typical
pc = 0;
while (bb)
{
pc++;
bb &= bb - 1;
}
If you do this like above, a factor of 2 might result. A very small
microoptimization for 32 bits is to do the loop on 2 32 bit parts of the
bitboard, and the operation count is almost the same for 64 bit and 32 bit
machines. There may be a little bit overhead for extracting the two parts
(depending on inline situation, compiler, ... this may almost be neglible).
Even the "magic mask and shift" popcount does not need the double operation
count, because at later "rounds" of the algorithm, you can add 2 32 bit parts
together and do the remaining algorithm on one 32 bit word.
Even another example: in my engine I use some bitboards, where I will need
several operations on the "King sqaure bit". One can add one outer if (sq >= 32)
and duplicate the code, now with all operations on 32 bit words.
About the operations in parallel: maybe it is even worse for the 32 version for
sub and add. I think add adc pairs don't pair well, because the adc needs the
carry bit already calculatet from the add. But or and and operations are
probably much more dominant in bitboard programs, than add and sub.
>>
>>4. Bitboard operations that require 64 bit shifts or rotations will speed up by
>>(roughly) a factor of four.
>
>On x86-64 that may be true for variable shifts. Constant 64-shifts for instance
>to get pawn attacks, are not that bad with two (dependent) x86-32 instructions.
Variable shifts will need some sort of "if", which might hurt. But that "double
precicion shift" upcode comes in handy for 64 bit shifts.
Rotations cannot be done efficiently in Standard C in either case (but the
compiler might have extensions to do it). When I tried, compilers could not
optimize the typical C statements into rot instructions.
>>
>>5. Arithmetic operations on 64 bit integers will likely see a speed doubling for
>>add/subtract and a quadrupling for multiply/divide.
>
>For 64-bit multiply/divide even more than doubling the speed, for add/sub most
>likely less than doubling (similar to constant shifts, a bit worse than two
>independent 32-bitwise operations).
In an engine, one will rarely need multiply/devide (on bitboards). One example
would be (using modulo instead of div)
hash_index = hash_key % number_of_entries;
and this will be more than 4 times slower, than a 32 bit modulo. But it is not
really needed. Just use the % on a 32 bit part of the hash_key (the exact result
will not be the same, but it will work as good). BTW.: In this example,
(assuming hash_key is 64 bit, number_of_entries and hash_index 32 bit) the x86
instruction div, which devides 64 bit by 32 bit is not sufficient, because the
devision will overflow in general (devide by zero exception results).
For multiplication, I cannot imagine a good example in a chess engine. My other
trick to calculate the hash index needs 32 * 32 -> 64 bit multiplication.
hash_index = ((unsigned long)hash_key * (unsigned long long)number_of_entries)
>> 32;
Compilers are able to generate just a normal "mul" instruction, and the result
is in edx. Careful (but almost obvious) casting is needed, however, to give the
compiler a hint, that really 32*32->64 bit multiplication is needed.
For real 64 bit multiplications/devisions: How is the cycle count for these
instructions on AMD64? I would fear, that it needs quite a few more cycles, than
32 bit counter parts.
Cheers,
Dieter
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.