Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Reverse Bitboards

Author: Matt Taylor

Date: 23:24:42 01/14/03

Go up one level in this thread


On January 14, 2003 at 14:54:13, Gerd Isenberg wrote:

>On January 14, 2003 at 12:52:58, Matt Taylor wrote:
>
>>On January 13, 2003 at 20:17:36, Russell Reagan wrote:
>>
>>>On January 13, 2003 at 18:30:05, Matt Taylor wrote:
>>>
>>>>I think the real bottleneck would be the misjudgement of the speed of MMX. It is
>>>>not as fast to respond as the integer units, though it maintains similar
>>>>throughput. Using MMX for 64-bit arithmetic is not worthwhile as the same
>>>>operations are available from the integer unit with lower setup costs. The only
>>>>advantages include a minor gain in parallelism in hand-tweaked code and
>>>>additional register space.
>>>
>>>Apparently if you use MMX correctly, it will be significantly faster than the
>>>corresponding routine written in C (if it relies on 64-bit operations). The
>>>primary example that comes to mind is that Gerd uses MMX in IsiChess to do
>>>64-bit operations in the KoggeStone algorithms. He said it gave him a small
>>>speed increase. Compare that with the same routines written in C, and the C
>>>routines will be significantly slower. I know this because I wrote a program
>>>using those routines in C and it got about 70 knps (compare with Crafty
>>>300-500knps), and all it did was alpha-beta, material + mobility eval, and
>>>nothing else. I tried several bitboard implementations, and the common factor in
>>>the slow ones was the C KoggeStone attack generation. Gerd didn't experience
>>>such a significant speed hit when he used his MMX routines. So it does appear
>>>that there is a misjudgement of the speed of using MMX, but I'm not sure whether
>>>it is an underestimation or overestimation.
>>
>>MMX is probably faster than straight C in some cases, but if you write the
>>64-bit stuff in assembly using the main integer instructions, it will almost
>>always be faster.
>
>Hi Matt,
>
>No - not on 32 bit hardware, or do you already mention Athlon64?
>Fill-Stuff like KoggeStone and dumb7fill is one of these cases. It gains a lot
>from available 64-bit or better 128-bit registers and doing parallel fills in
>several directions.
>
>One may use MMX- or hammers gp-registers for propagator computation (based on
>empty squares) and SSE2 for a doubled generator, eg. for black/white or to get
>two disjoint attack sets for sliders simultaniously.

No, I assume 32-bit. You can emulate most 64-bit ops very quickly:

bb + x
    add    eax, x.low
    adc    edx, x.high

bb - x
    sub    eax, x.low
    sbb    edx, x.high

bb >>= 1
    shr    edx, x
    rcr    eax, 1

bb <<= 1
    add    eax, eax
    adc    edx, edx

bb >>= x
    mov    ecx, edx
    shr    eax, x
    shl    ecx, 32 - x
    shr    edx, x
    or     eax, ecx

bb <<= x
    mov    ecx, eax
    shl    edx, x
    shr    ecx, 32 - x
    shl    eax, x
    or     edx, ecx

-bb
    not    edx
    neg    eax
    sbb    edx, -1

This brings up an interesting point, however. The integer/MMX units can -also-
work in parallel.

>>The latency of an ALU instruction
>>(bitwise/arithmatic/conditional) is 1, and it has been ever since the 486. The
>>latency for similar arithmatic MMX instructions on my Athlon is 2 clocks, and on
>>a Pentium 4 it is 2 or worse. On the same processors, you can do 64-bit
>>operations usually in 1 clock.
>
>Yes, but my current getQueenAttacks, based on KoggeStone, dumb7fill and x^x-2
>has an effective instruction latency of 0.5 cycles (up to four MMX-instruction
>per time!).

My biggest worry was the latency, but it seems you got around that. I'll have to
go back and check my data, but I only saw Athlon peaking with a throughput of 2
arithmatic MMX instructions/cycle. It has the same throughput on SSE, meaning
that SSE = twice as fast assuming you can fill the upper 8-bytes and avoid
latency.

By the way, movd is an awful instruction. This was the other thought I had when
I said MMX wasn't so hot. You're going to eat 10 cycles for the 2 movd
instructions at the end of your function. I encountered similar issues when
working on the bit scanners. The movd means integer is always going to beat
relatively short MMX routines.

>>The only advantage to MMX is the extra registers you now have access to, but in
>>my experiences code rarely saturates more than one of the 3 instruction sets
>>(integer, FP, vector). Furthermore, movement of data between MMX registers and
>>integers is horrifically slow, and if you mix with floating-point, you have to
>>execute another slow instruction -- emms.
>>
>
>Yes, but chess engines don't use floating point instructions so much, i guess.
>The vector path movd is really slow, i'll hope this becomes better with hammer.
>Currently i use a final movq via aligned memory.

They don't, but you never know when a library function decides to use
floating-point for something. Maybe I'm just paranoid, but I always assume it.

>>I think greater performance can be achieved in hand-tweaked, purely-integer
>>assembly. Unfortunately I do not have time right now to prove that theory, but
>>if I ever get a chance, I will be sure to post some code.
>>
>>-Matt
>
>Ok, try this one. My first, not optimal MMX-approach:

I'll try to have a look at it this weekend. I'll see about tweaking your MMX
routine, too.

MMX does hold one card though that seems to be mocking me. MMX has 8 64-bit
registers. Integer has 8 32-bit (or 4 64-bit) registers. Hmm...I really hate
doing this because it's -ugly-, but I think I'm going to use the stack pointer
as an additional GPR. I will need it.

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.