Author: Matt Taylor
Date: 23:24:42 01/14/03
Go up one level in this thread
On January 14, 2003 at 14:54:13, Gerd Isenberg wrote:
>On January 14, 2003 at 12:52:58, Matt Taylor wrote:
>
>>On January 13, 2003 at 20:17:36, Russell Reagan wrote:
>>
>>>On January 13, 2003 at 18:30:05, Matt Taylor wrote:
>>>
>>>>I think the real bottleneck would be the misjudgement of the speed of MMX. It is
>>>>not as fast to respond as the integer units, though it maintains similar
>>>>throughput. Using MMX for 64-bit arithmetic is not worthwhile as the same
>>>>operations are available from the integer unit with lower setup costs. The only
>>>>advantages include a minor gain in parallelism in hand-tweaked code and
>>>>additional register space.
>>>
>>>Apparently if you use MMX correctly, it will be significantly faster than the
>>>corresponding routine written in C (if it relies on 64-bit operations). The
>>>primary example that comes to mind is that Gerd uses MMX in IsiChess to do
>>>64-bit operations in the KoggeStone algorithms. He said it gave him a small
>>>speed increase. Compare that with the same routines written in C, and the C
>>>routines will be significantly slower. I know this because I wrote a program
>>>using those routines in C and it got about 70 knps (compare with Crafty
>>>300-500knps), and all it did was alpha-beta, material + mobility eval, and
>>>nothing else. I tried several bitboard implementations, and the common factor in
>>>the slow ones was the C KoggeStone attack generation. Gerd didn't experience
>>>such a significant speed hit when he used his MMX routines. So it does appear
>>>that there is a misjudgement of the speed of using MMX, but I'm not sure whether
>>>it is an underestimation or overestimation.
>>
>>MMX is probably faster than straight C in some cases, but if you write the
>>64-bit stuff in assembly using the main integer instructions, it will almost
>>always be faster.
>
>Hi Matt,
>
>No - not on 32 bit hardware, or do you already mention Athlon64?
>Fill-Stuff like KoggeStone and dumb7fill is one of these cases. It gains a lot
>from available 64-bit or better 128-bit registers and doing parallel fills in
>several directions.
>
>One may use MMX- or hammers gp-registers for propagator computation (based on
>empty squares) and SSE2 for a doubled generator, eg. for black/white or to get
>two disjoint attack sets for sliders simultaniously.
No, I assume 32-bit. You can emulate most 64-bit ops very quickly:
bb + x
add eax, x.low
adc edx, x.high
bb - x
sub eax, x.low
sbb edx, x.high
bb >>= 1
shr edx, x
rcr eax, 1
bb <<= 1
add eax, eax
adc edx, edx
bb >>= x
mov ecx, edx
shr eax, x
shl ecx, 32 - x
shr edx, x
or eax, ecx
bb <<= x
mov ecx, eax
shl edx, x
shr ecx, 32 - x
shl eax, x
or edx, ecx
-bb
not edx
neg eax
sbb edx, -1
This brings up an interesting point, however. The integer/MMX units can -also-
work in parallel.
>>The latency of an ALU instruction
>>(bitwise/arithmatic/conditional) is 1, and it has been ever since the 486. The
>>latency for similar arithmatic MMX instructions on my Athlon is 2 clocks, and on
>>a Pentium 4 it is 2 or worse. On the same processors, you can do 64-bit
>>operations usually in 1 clock.
>
>Yes, but my current getQueenAttacks, based on KoggeStone, dumb7fill and x^x-2
>has an effective instruction latency of 0.5 cycles (up to four MMX-instruction
>per time!).
My biggest worry was the latency, but it seems you got around that. I'll have to
go back and check my data, but I only saw Athlon peaking with a throughput of 2
arithmatic MMX instructions/cycle. It has the same throughput on SSE, meaning
that SSE = twice as fast assuming you can fill the upper 8-bytes and avoid
latency.
By the way, movd is an awful instruction. This was the other thought I had when
I said MMX wasn't so hot. You're going to eat 10 cycles for the 2 movd
instructions at the end of your function. I encountered similar issues when
working on the bit scanners. The movd means integer is always going to beat
relatively short MMX routines.
>>The only advantage to MMX is the extra registers you now have access to, but in
>>my experiences code rarely saturates more than one of the 3 instruction sets
>>(integer, FP, vector). Furthermore, movement of data between MMX registers and
>>integers is horrifically slow, and if you mix with floating-point, you have to
>>execute another slow instruction -- emms.
>>
>
>Yes, but chess engines don't use floating point instructions so much, i guess.
>The vector path movd is really slow, i'll hope this becomes better with hammer.
>Currently i use a final movq via aligned memory.
They don't, but you never know when a library function decides to use
floating-point for something. Maybe I'm just paranoid, but I always assume it.
>>I think greater performance can be achieved in hand-tweaked, purely-integer
>>assembly. Unfortunately I do not have time right now to prove that theory, but
>>if I ever get a chance, I will be sure to post some code.
>>
>>-Matt
>
>Ok, try this one. My first, not optimal MMX-approach:
I'll try to have a look at it this weekend. I'll see about tweaking your MMX
routine, too.
MMX does hold one card though that seems to be mocking me. MMX has 8 64-bit
registers. Integer has 8 32-bit (or 4 64-bit) registers. Hmm...I really hate
doing this because it's -ugly-, but I think I'm going to use the stack pointer
as an additional GPR. I will need it.
-Matt
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.