Author: Matt Taylor
Date: 04:38:16 01/22/03
Go up one level in this thread
On January 22, 2003 at 07:00:30, Dann Corbit wrote: >On January 22, 2003 at 06:47:09, Matt Taylor wrote: > >>On January 22, 2003 at 06:44:08, Dann Corbit wrote: >> >>>On January 22, 2003 at 06:09:22, Dann Corbit wrote: >>> >>>>On January 22, 2003 at 05:33:20, Matt Taylor wrote: >>>> >>>>>On January 22, 2003 at 04:27:16, Dann Corbit wrote: >>>>> >>>>>>On January 22, 2003 at 03:29:05, Matt Taylor wrote: >>>>>> >>>>>>>On January 21, 2003 at 17:03:33, Dann Corbit wrote: >>>>>>> >>>>>>>>On January 21, 2003 at 15:48:57, Sander de Zoete wrote: >>>>>>>> >>>>>>>>>The following instruction I found for the new architecture of Intel Chips >>>>>>>>> >>>>>>>>>BSWAP—Byte Swap >>>>>>>>> >>>>>>>>>Description >>>>>>>>>Reverses the byte order of a 32-bit (destination) register: >>>>>>>>> >>>>>>>>>Operation >>>>>>>>>TEMP ¬ DEST >>>>>>>>>DEST(7..0) ¬ TEMP(31..24) >>>>>>>>>DEST(15..8) ¬ TEMP(23..16) >>>>>>>>>DEST(23..16) ¬ TEMP(15..8) >>>>>>>>>DEST(31..24) ¬ TEMP(7..0) >>>>>>>>>Flags Affected >>>>>>>>>None. >>>>>>>>>Opcode Instruction Description >>>>>>>>>0F C8+ rd BSWAP r32 Reverses the byte order of a 32-bit register. >>>>>>>>> >>>>>>>>>It is only valid for 486 architecture. >>>>>>>>> >>>>>>>>>If this instruction can be used, it should be very easy to reverse Reverse >>>>>>>>>BitBoards into Forward Boards again. Saving a lot of updating in make and unmake >>>>>>>>>move and also be much faster using the bitboards for generating attack >>>>>>>>>board for evaluation purposes, incheck checks etc. >>>>>>>>> >>>>>>>>>The thing left to do is to reverse the bits per byte. >>>>>>>>> >>>>>>>>>I use the following trick, but it doesn't seem to work voor the value 9. Let me >>>>>>>>>show you what I mean: (and ofcourse, can you help me?) >>>>>>>>> >>>>>>>>>In C code, it looks like this: >>>>>>>>> >>>>>>>>>typedef unsigned __int64 BITBOARD; >>>>>>>>> >>>>>>>>>void ReverseBitsPerByte(BITBOARD bitboard) >>>>>>>>>{ >>>>>>>>>// 1. Load the constant, k = 5555555555555555h >>>>>>>>>BITBOARD k = 0x5555555555555555; >>>>>>>>> >>>>>>>>>// 2. x = [(x shl 1) and k] or [(x and k) shr 1] result is: EFCDAB8967452301 >>>>>>>>>bitboard = ((bitboard<<1) & k) | ((bitboard & k)>>1 ); >>>>>>>>>} >>>>>>>>> >>>>>>>>>Initial bitboard: >>>>>>>>>11111110 >>>>>>>>>11011100 >>>>>>>>>10111010 >>>>>>>>>10011000 >>>>>>>>>01110110 >>>>>>>>>01010100 >>>>>>>>>00110010 >>>>>>>>>00010000 >>>>>>>>> >>>>>>>>>Result of ReverseBitsPerByte(bitboard): >>>>>>>>>01111111 >>>>>>>>>00111011 >>>>>>>>>01011101 >>>>>>>>>00011000<--what the ^*&* goes wrong here? Should be 00011001. >>>>>>>>>01101110 >>>>>>>>>00101010 >>>>>>>>>01001100 >>>>>>>>>00001000 >>>>>>>>> >>>>>>>>>Thanks for any help or suggestions. >>>>>>>> >>>>>>><snip code> >>>>>>> >>>>>>>These are all basically the same algorithm. It's going to take a while to >>>>>>>rearrange everything when you do it bit-by-bit. Sander has the right idea by >>>>>>>first swapping bytes and then trying to reverse the bits within the bytes (8*2 >>>>>>>ops + the swap as opposed to 64*2 ops). >>>>>> >>>>>>The algorithms don't perform identically. >>>>>>They don't require 64 ops. >>>>>>Once you get the "Sander" algorithm completed, compare the two (and also the >>>>>>COBRA algorithm). >>>>> >>>>>Yes, breverse5 is a divide-and-conquer algorithm. The others all iterate 64 >>>>>times. Assuming 2 ops/iteration was a gross underestimate as they use heavy >>>>>masking and shifting. >>>>> >>>>>2 bswaps (and probably a mov, too) will reverse the bytes; afterward, the bits >>>>>can be reversed as well. >>>>> >>>>>Not sure when or if I can have a look at COBRA and think about the algorithm -- >>>>>I've got more than enough mandatory work right now. >>>> >>>>The COBRA algorithm is designed for big bit sequences. I don't know how well it >>>>adapts to 64 bits. I have only just downloaded the paper. >>>> >>>>This algorithm: >>>> >>>>typedef unsigned long ling Bitboard; >>>> >>>>Bitboard breverse5(Bitboard n) >>>>{ >>>> int i = 64; >>>> Bitboard m = -1; >>>> while (i /= 2) >>>> m ^= m << i, n = n >> i & m | (n & m) << i; >>>> return n; >>>>} >>>> >>>>iterates 6 times. I suspect it will be faster than something table driven. >>>>Perhaps also faster than 8 mask and shift operations. >>>> >>>>I'll be surprised if anything can beat it by 20%. >>> >>>Here is the generated assembly language. It would look a lot prettier on a 64 >>>bit chip, for sure. >><snip> >> >>Yes. Note: >>0006a e8 fc ff ff ff call __allshl ;C:\tmp\brev.cpp >> ^ Yuck. >> >>This is why parallel 32-bit w/bswap is much faster. >> >>If you want, I can time your algorithm. It will be quite nasty (because of >>inefficient 64-bit shifts), I assure you. Each call is 8 clocks or more. > >This will be the rough alternative using the suggested algorithm: > >unsigned int >reverse32(register unsigned int x) >{ > register unsigned int y = 0x55555555; > x = (((x >> 1) & y) | ((x & y) << 1)); > y = 0x33333333; > x = (((x >> 2) & y) | ((x & y) << 2)); > y = 0x0f0f0f0f; > x = (((x >> 4) & y) | ((x & y) << 4)); > y = 0x00ff00ff; > x = (((x >> 8) & y) | ((x & y) << 8)); > return((x >> 16) | (x << 16)); >} > >typedef union tag_faker { > unsigned int a[2]; > unsigned _int64 b; >} faker; > >unsigned __int64 >reverse64(faker x) >{ > unsigned int a = reverse32(x.a[0]); > x.a[0] = reverse32(x.a[1]); > x.a[1] = a; > return x.b; >} > >When we get chips with real 64 bit CPUs, the speed will pick up a ton. The above is basically what I coded. I was slightly wrong about bswap; nonetheless, it is still able to do 2 iterations in 2 instructions and 1 cycle. I am not so sure that a 64-bit machine word will really change the efficiency of these sorts of algorithms. One of the advantages of 32-bitness here is your reverse64 which uses two reverse32 calls. It is possible to schedule most of the 32-bit stuff to get the second one almost free. The swap itself is cheaper than the equivalent swap x.a[1] and x.a[0] Lots of stalls here, runs in 18 cycles (former iterations were faster): uint64 my_rev(uint64 n) { _asm { ; Input edx:eax ; Output edx:eax bswap eax bswap edx mov ecx, eax mov eax, edx mov esi, 0x0F0F0F0F mov edx, ecx shr ecx, 4 and edx, esi and ecx, esi shl edx, 4 or edx, ecx mov ebx, eax shr eax, 4 and ebx, esi and eax, esi shl ebx, 4 or eax, ebx mov esi, 0x33333333 mov ecx, edx shr ecx, 2 and edx, esi and ecx, esi lea edx, [edx*4+ecx] mov ebx, eax shr eax, 2 and ebx, esi and eax, esi lea eax, [ebx*4+eax] mov esi, 0x55555555 mov ecx, edx shr ecx, 1 and edx, esi and ecx, esi lea edx, [edx*2+ecx] mov ebx, eax shr eax, 1 and ebx, esi and eax, esi lea eax, [ebx*2+eax] mov DWORD PTR [n+4], eax mov DWORD PTR [n], edx } return n; } Lots of weirdness, as usual. Instructions with no dependencies aren't executing in the same cycle. I posted known-good, sub-optimal code because I don't have time to meticulously scan this and fix it. This executed in 20 clocks; I've had as low as 17, and I'm pretty sure it could be faster. I may take another crack at this later. -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.