Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: How to reverse a Reverse BitBoard!

Author: Matt Taylor

Date: 04:38:16 01/22/03

Go up one level in this thread


On January 22, 2003 at 07:00:30, Dann Corbit wrote:

>On January 22, 2003 at 06:47:09, Matt Taylor wrote:
>
>>On January 22, 2003 at 06:44:08, Dann Corbit wrote:
>>
>>>On January 22, 2003 at 06:09:22, Dann Corbit wrote:
>>>
>>>>On January 22, 2003 at 05:33:20, Matt Taylor wrote:
>>>>
>>>>>On January 22, 2003 at 04:27:16, Dann Corbit wrote:
>>>>>
>>>>>>On January 22, 2003 at 03:29:05, Matt Taylor wrote:
>>>>>>
>>>>>>>On January 21, 2003 at 17:03:33, Dann Corbit wrote:
>>>>>>>
>>>>>>>>On January 21, 2003 at 15:48:57, Sander de Zoete wrote:
>>>>>>>>
>>>>>>>>>The following instruction I found for the new architecture of Intel Chips
>>>>>>>>>
>>>>>>>>>BSWAP—Byte Swap
>>>>>>>>>
>>>>>>>>>Description
>>>>>>>>>Reverses the byte order of a 32-bit (destination) register:
>>>>>>>>>
>>>>>>>>>Operation
>>>>>>>>>TEMP ¬ DEST
>>>>>>>>>DEST(7..0) ¬ TEMP(31..24)
>>>>>>>>>DEST(15..8) ¬ TEMP(23..16)
>>>>>>>>>DEST(23..16) ¬ TEMP(15..8)
>>>>>>>>>DEST(31..24) ¬ TEMP(7..0)
>>>>>>>>>Flags Affected
>>>>>>>>>None.
>>>>>>>>>Opcode Instruction Description
>>>>>>>>>0F C8+ rd BSWAP r32 Reverses the byte order of a 32-bit register.
>>>>>>>>>
>>>>>>>>>It is only valid for 486 architecture.
>>>>>>>>>
>>>>>>>>>If this instruction can be used, it should be very easy to reverse Reverse
>>>>>>>>>BitBoards into Forward Boards again. Saving a lot of updating in make and unmake
>>>>>>>>>move and also be much faster using the bitboards for generating attack
>>>>>>>>>board for evaluation purposes, incheck checks etc.
>>>>>>>>>
>>>>>>>>>The thing left to do is to reverse the bits per byte.
>>>>>>>>>
>>>>>>>>>I use the following trick, but it doesn't seem to work voor the value 9. Let me
>>>>>>>>>show you what I mean: (and ofcourse, can you help me?)
>>>>>>>>>
>>>>>>>>>In C code, it looks like this:
>>>>>>>>>
>>>>>>>>>typedef unsigned __int64 BITBOARD;
>>>>>>>>>
>>>>>>>>>void ReverseBitsPerByte(BITBOARD bitboard)
>>>>>>>>>{
>>>>>>>>>//  1.  Load the constant, k = 5555555555555555h
>>>>>>>>>BITBOARD k = 0x5555555555555555;
>>>>>>>>>
>>>>>>>>>//  2.  x = [(x shl 1) and k] or [(x and k) shr 1] result is: EFCDAB8967452301
>>>>>>>>>bitboard = ((bitboard<<1) & k) | ((bitboard & k)>>1 );
>>>>>>>>>}
>>>>>>>>>
>>>>>>>>>Initial bitboard:
>>>>>>>>>11111110
>>>>>>>>>11011100
>>>>>>>>>10111010
>>>>>>>>>10011000
>>>>>>>>>01110110
>>>>>>>>>01010100
>>>>>>>>>00110010
>>>>>>>>>00010000
>>>>>>>>>
>>>>>>>>>Result of ReverseBitsPerByte(bitboard):
>>>>>>>>>01111111
>>>>>>>>>00111011
>>>>>>>>>01011101
>>>>>>>>>00011000<--what the ^*&* goes wrong here? Should be 00011001.
>>>>>>>>>01101110
>>>>>>>>>00101010
>>>>>>>>>01001100
>>>>>>>>>00001000
>>>>>>>>>
>>>>>>>>>Thanks for any help or suggestions.
>>>>>>>>
>>>>>>><snip code>
>>>>>>>
>>>>>>>These are all basically the same algorithm. It's going to take a while to
>>>>>>>rearrange everything when you do it bit-by-bit. Sander has the right idea by
>>>>>>>first swapping bytes and then trying to reverse the bits within the bytes (8*2
>>>>>>>ops + the swap as opposed to 64*2 ops).
>>>>>>
>>>>>>The algorithms don't perform identically.
>>>>>>They don't require 64 ops.
>>>>>>Once you get the "Sander" algorithm completed, compare the two (and also the
>>>>>>COBRA algorithm).
>>>>>
>>>>>Yes, breverse5 is a divide-and-conquer algorithm. The others all iterate 64
>>>>>times. Assuming 2 ops/iteration was a gross underestimate as they use heavy
>>>>>masking and shifting.
>>>>>
>>>>>2 bswaps (and probably a mov, too) will reverse the bytes; afterward, the bits
>>>>>can be reversed as well.
>>>>>
>>>>>Not sure when or if I can have a look at COBRA and think about the algorithm --
>>>>>I've got more than enough mandatory work right now.
>>>>
>>>>The COBRA algorithm is designed for big bit sequences.  I don't know how well it
>>>>adapts to 64 bits.  I have only just downloaded the paper.
>>>>
>>>>This algorithm:
>>>>
>>>>typedef unsigned long ling Bitboard;
>>>>
>>>>Bitboard        breverse5(Bitboard n)
>>>>{
>>>>    int             i = 64;
>>>>    Bitboard        m = -1;
>>>>    while (i /= 2)
>>>>        m ^= m << i, n = n >> i & m | (n & m) << i;
>>>>    return n;
>>>>}
>>>>
>>>>iterates 6 times.  I suspect it will be faster than something table driven.
>>>>Perhaps also faster than 8 mask and shift operations.
>>>>
>>>>I'll be surprised if anything can beat it by 20%.
>>>
>>>Here is the generated assembly language.  It would look a lot prettier on a 64
>>>bit chip, for sure.
>><snip>
>>
>>Yes. Note:
>>0006a e8 fc ff ff ff   call __allshl                          ;C:\tmp\brev.cpp
>>                       ^ Yuck.
>>
>>This is why parallel 32-bit w/bswap is much faster.
>>
>>If you want, I can time your algorithm. It will be quite nasty (because of
>>inefficient 64-bit shifts), I assure you. Each call is 8 clocks or more.
>
>This will be the rough alternative using the suggested algorithm:
>
>unsigned int
>reverse32(register unsigned int x)
>{
>        register unsigned int y = 0x55555555;
>        x = (((x >> 1) & y) | ((x & y) << 1));
>        y = 0x33333333;
>        x = (((x >> 2) & y) | ((x & y) << 2));
>        y = 0x0f0f0f0f;
>        x = (((x >> 4) & y) | ((x & y) << 4));
>        y = 0x00ff00ff;
>        x = (((x >> 8) & y) | ((x & y) << 8));
>        return((x >> 16) | (x << 16));
>}
>
>typedef union tag_faker {
>   unsigned int a[2];
>   unsigned _int64 b;
>} faker;
>
>unsigned __int64
>reverse64(faker x)
>{
>   unsigned int a = reverse32(x.a[0]);
>   x.a[0] = reverse32(x.a[1]);
>   x.a[1] = a;
>   return x.b;
>}
>
>When we get chips with real 64 bit CPUs, the speed will pick up a ton.

The above is basically what I coded. I was slightly wrong about bswap;
nonetheless, it is still able to do 2 iterations in 2 instructions and 1 cycle.

I am not so sure that a 64-bit machine word will really change the efficiency of
these sorts of algorithms. One of the advantages of 32-bitness here is your
reverse64 which uses two reverse32 calls. It is possible to schedule most of the
32-bit stuff to get the second one almost free. The swap itself is cheaper than
the equivalent swap x.a[1] and x.a[0]

Lots of stalls here, runs in 18 cycles (former iterations were faster):
uint64 my_rev(uint64 n)
{
	_asm
	{
		; Input edx:eax
		; Output edx:eax
		bswap	eax
		bswap	edx

		mov	ecx, eax
		mov	eax, edx
		mov	esi, 0x0F0F0F0F

		mov	edx, ecx
		shr	ecx, 4
		and	edx, esi
		and	ecx, esi
		shl	edx, 4
		or	edx, ecx

		mov	ebx, eax
		shr	eax, 4
		and	ebx, esi
		and	eax, esi
		shl	ebx, 4
		or	eax, ebx

		mov	esi, 0x33333333

		mov	ecx, edx
		shr	ecx, 2
		and	edx, esi
		and	ecx, esi
		lea	edx, [edx*4+ecx]

		mov	ebx, eax
		shr	eax, 2
		and	ebx, esi
		and	eax, esi
		lea	eax, [ebx*4+eax]

		mov	esi, 0x55555555

		mov	ecx, edx
		shr	ecx, 1
		and	edx, esi
		and	ecx, esi
		lea	edx, [edx*2+ecx]

		mov	ebx, eax
		shr	eax, 1
		and	ebx, esi
		and	eax, esi
		lea	eax, [ebx*2+eax]

		mov	DWORD PTR [n+4], eax
		mov	DWORD PTR [n], edx
	}

	return n;
}

Lots of weirdness, as usual. Instructions with no dependencies aren't executing
in the same cycle. I posted known-good, sub-optimal code because I don't have
time to meticulously scan this and fix it. This executed in 20 clocks; I've had
as low as 17, and I'm pretty sure it could be faster.

I may take another crack at this later.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.