Author: Matt Taylor
Date: 23:24:42 01/14/03
Go up one level in this thread
On January 14, 2003 at 14:54:13, Gerd Isenberg wrote: >On January 14, 2003 at 12:52:58, Matt Taylor wrote: > >>On January 13, 2003 at 20:17:36, Russell Reagan wrote: >> >>>On January 13, 2003 at 18:30:05, Matt Taylor wrote: >>> >>>>I think the real bottleneck would be the misjudgement of the speed of MMX. It is >>>>not as fast to respond as the integer units, though it maintains similar >>>>throughput. Using MMX for 64-bit arithmetic is not worthwhile as the same >>>>operations are available from the integer unit with lower setup costs. The only >>>>advantages include a minor gain in parallelism in hand-tweaked code and >>>>additional register space. >>> >>>Apparently if you use MMX correctly, it will be significantly faster than the >>>corresponding routine written in C (if it relies on 64-bit operations). The >>>primary example that comes to mind is that Gerd uses MMX in IsiChess to do >>>64-bit operations in the KoggeStone algorithms. He said it gave him a small >>>speed increase. Compare that with the same routines written in C, and the C >>>routines will be significantly slower. I know this because I wrote a program >>>using those routines in C and it got about 70 knps (compare with Crafty >>>300-500knps), and all it did was alpha-beta, material + mobility eval, and >>>nothing else. I tried several bitboard implementations, and the common factor in >>>the slow ones was the C KoggeStone attack generation. Gerd didn't experience >>>such a significant speed hit when he used his MMX routines. So it does appear >>>that there is a misjudgement of the speed of using MMX, but I'm not sure whether >>>it is an underestimation or overestimation. >> >>MMX is probably faster than straight C in some cases, but if you write the >>64-bit stuff in assembly using the main integer instructions, it will almost >>always be faster. > >Hi Matt, > >No - not on 32 bit hardware, or do you already mention Athlon64? >Fill-Stuff like KoggeStone and dumb7fill is one of these cases. It gains a lot >from available 64-bit or better 128-bit registers and doing parallel fills in >several directions. > >One may use MMX- or hammers gp-registers for propagator computation (based on >empty squares) and SSE2 for a doubled generator, eg. for black/white or to get >two disjoint attack sets for sliders simultaniously. No, I assume 32-bit. You can emulate most 64-bit ops very quickly: bb + x add eax, x.low adc edx, x.high bb - x sub eax, x.low sbb edx, x.high bb >>= 1 shr edx, x rcr eax, 1 bb <<= 1 add eax, eax adc edx, edx bb >>= x mov ecx, edx shr eax, x shl ecx, 32 - x shr edx, x or eax, ecx bb <<= x mov ecx, eax shl edx, x shr ecx, 32 - x shl eax, x or edx, ecx -bb not edx neg eax sbb edx, -1 This brings up an interesting point, however. The integer/MMX units can -also- work in parallel. >>The latency of an ALU instruction >>(bitwise/arithmatic/conditional) is 1, and it has been ever since the 486. The >>latency for similar arithmatic MMX instructions on my Athlon is 2 clocks, and on >>a Pentium 4 it is 2 or worse. On the same processors, you can do 64-bit >>operations usually in 1 clock. > >Yes, but my current getQueenAttacks, based on KoggeStone, dumb7fill and x^x-2 >has an effective instruction latency of 0.5 cycles (up to four MMX-instruction >per time!). My biggest worry was the latency, but it seems you got around that. I'll have to go back and check my data, but I only saw Athlon peaking with a throughput of 2 arithmatic MMX instructions/cycle. It has the same throughput on SSE, meaning that SSE = twice as fast assuming you can fill the upper 8-bytes and avoid latency. By the way, movd is an awful instruction. This was the other thought I had when I said MMX wasn't so hot. You're going to eat 10 cycles for the 2 movd instructions at the end of your function. I encountered similar issues when working on the bit scanners. The movd means integer is always going to beat relatively short MMX routines. >>The only advantage to MMX is the extra registers you now have access to, but in >>my experiences code rarely saturates more than one of the 3 instruction sets >>(integer, FP, vector). Furthermore, movement of data between MMX registers and >>integers is horrifically slow, and if you mix with floating-point, you have to >>execute another slow instruction -- emms. >> > >Yes, but chess engines don't use floating point instructions so much, i guess. >The vector path movd is really slow, i'll hope this becomes better with hammer. >Currently i use a final movq via aligned memory. They don't, but you never know when a library function decides to use floating-point for something. Maybe I'm just paranoid, but I always assume it. >>I think greater performance can be achieved in hand-tweaked, purely-integer >>assembly. Unfortunately I do not have time right now to prove that theory, but >>if I ever get a chance, I will be sure to post some code. >> >>-Matt > >Ok, try this one. My first, not optimal MMX-approach: I'll try to have a look at it this weekend. I'll see about tweaking your MMX routine, too. MMX does hold one card though that seems to be mocking me. MMX has 8 64-bit registers. Integer has 8 32-bit (or 4 64-bit) registers. Hmm...I really hate doing this because it's -ugly-, but I think I'm going to use the stack pointer as an additional GPR. I will need it. -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.