Author: Gerd Isenberg
Date: 16:36:31 09/04/03
Go up one level in this thread
On September 04, 2003 at 19:04:03, Dezhi Zhao wrote: >On September 04, 2003 at 17:25:38, Gerd Isenberg wrote: > >>On September 04, 2003 at 15:08:37, Dezhi Zhao wrote: >> >>>Yes, SSE could beat the regular by a small margin. >> >>Great - but try it in a real chess program. >> >>In these loop tests some internal unrolling may occur, with "renamed" register >>sets. I guess the loop-body including the called SSE-routine fits in P4's trace >>cache, and two (or more) bodies are executed simultaniusly. >> >>With the gp-register approach this is only partially possible, due to registers >>are changed several times inside the body. > >I only did an easy timing here as you see. You can do an accurate timing with >the processor counters and cpuid and other instructions to isolate other >factors. However I do not think accurate timing could change the results. > >I think this test simulate a real chess program quite well. Only noncapture are >tested here. No - you don't have such heavy hashupdate loops in a chess program, there are some instructions around. >In a real program, you have to handle captures that needs more xor >operations and more register pressure. Therefore, SSE will help more in a real >one. Yes, if you do even more than only hashupdate, it becomes interesting in the context of the copymake thread here, possibly using all 128-bits. That's the reason i will try to use them for kogge-stone-fill attack generation on opteron. Doing simd with two sets in parallel. > >> >>Anyway not that bad for P4, considering only 64-bits of 128 used per >>XMM-register. What about MMX on P4 and what about SSE2-integer instructions, >>movdqa, movdqu, movd and pxor? Ok one byte more opcode, but shorter latencies >>(at least on opteron for movdqa and pxor). I do not have P4 SSE-instruction >>latencies, but for opteron, mov unaligned is a killer due to vector path >>instruction, movups as well as movdqu. > >I'm not interested at MMX because emms overhead is quite large. Just tried SSE2. Do you use floats or doubles? If not, ignore emms. Use simply movd for all 8-byte aligned moves, and pxor. I'm interesting whether mmx performs faster due to 64-bit rather than 128-bit instructions on P4. >Here is the results of 4 runs: > >#1 >old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 23s >old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 22s > >#2 >old_key = 18be678400294823, new_key = 4512153f17260b03, c++ = 30s >old_key = 18be678400294823, new_key = 4512153f17260b03, asm = 26s >old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 22s >old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 23s > >#3 >old_key = 18be678400294823, new_key = 4512153f17260b03, c++ = 30s >old_key = 18be678400294823, new_key = 4512153f17260b03, asm = 26s >old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 22s >old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 22s > >#4 >old_key = 18be678400294823, new_key = 4512153f17260b03, c++ = 30s >old_key = 18be678400294823, new_key = 4512153f17260b03, asm = 26s >old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 22s >old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 22s > >As you see, I need a better gague to tell the difference between SSE and SSE2. > >__declspec(naked) void __fastcall update_key_non_capture_sse2(int move) >{ > __asm > { > movzx eax, cl // from > movzx edx, ch // to > shr ecx, 10 // type * 64 > and ecx, ~63 // mask off > movdqa xmm2, [old_key] // old_key 128 > > add eax, ecx // type from index > add edx, ecx // type to index > > movdqu xmm0, type_rnd[eax*8] // from 128 > movdqu xmm1, type_rnd[edx*8] // to 128 > pxor xmm0, xmm2 > pxor xmm0, xmm1 > > movdqa [new_key], xmm0 // store 64 > ret > } >} > >> >>I'm intereseted in the assembler output of your C-routine - a bit stange that it >>performs so "badly". > >Not too bad for my VC6 with sp5:) You may notice that I give the compiler some >hints so that it does generate too lousy instructions. Perhaps the new VC >compiler could do better. It uses one more register and has other problems too. >That is why I hand write an asm version as a base line. Here is the compiler >output (options: max speed, PPro): > Thanks, i see - sometimes an additional register gains more parallelism ;-) Seems not be the case here. >_TEXT SEGMENT >?update_key_non_capture@@YIXH@Z PROC NEAR ; update_key_non_capture, COMDAT >; _move$ = ecx >; Line 28 > xor edx, edx > mov dl, ch > mov eax, ecx > sar eax, 16 ; 00000010H > movzx ecx, cl > shl eax, 6 > add edx, eax > add ecx, eax > mov eax, DWORD PTR ?type_rnd@@3PAY0EA@_KA[edx*8] > mov edx, DWORD PTR ?type_rnd@@3PAY0EA@_KA[edx*8+4] > push esi > xor eax, DWORD PTR ?type_rnd@@3PAY0EA@_KA[ecx*8] > mov esi, DWORD PTR ?type_rnd@@3PAY0EA@_KA[ecx*8+4] > xor eax, DWORD PTR ?old_key@@3_KA > mov ecx, DWORD PTR ?old_key@@3_KA+4 > xor edx, esi > xor edx, ecx > mov DWORD PTR ?new_key@@3_KA, eax > mov DWORD PTR ?new_key@@3_KA+4, edx > pop esi >; Line 29 > ret 0 >?update_key_non_capture@@YIXH@Z ENDP ; update_key_non_capture >_TEXT ENDS > > >> >>Regards, >>Gerd >> >> > ><snip>
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.