Author: Dezhi Zhao
Date: 16:04:03 09/04/03
Go up one level in this thread
On September 04, 2003 at 17:25:38, Gerd Isenberg wrote: >On September 04, 2003 at 15:08:37, Dezhi Zhao wrote: > >>Yes, SSE could beat the regular by a small margin. > >Great - but try it in a real chess program. > >In these loop tests some internal unrolling may occur, with "renamed" register >sets. I guess the loop-body including the called SSE-routine fits in P4's trace >cache, and two (or more) bodies are executed simultaniusly. > >With the gp-register approach this is only partially possible, due to registers >are changed several times inside the body. I only did an easy timing here as you see. You can do an accurate timing with the processor counters and cpuid and other instructions to isolate other factors. However I do not think accurate timing could change the results. I think this test simulate a real chess program quite well. Only noncapture are tested here. In a real program, you have to handle captures that needs more xor operations and more register pressure. Therefore, SSE will help more in a real one. > >Anyway not that bad for P4, considering only 64-bits of 128 used per >XMM-register. What about MMX on P4 and what about SSE2-integer instructions, >movdqa, movdqu, movd and pxor? Ok one byte more opcode, but shorter latencies >(at least on opteron for movdqa and pxor). I do not have P4 SSE-instruction >latencies, but for opteron, mov unaligned is a killer due to vector path >instruction, movups as well as movdqu. I'm not interested at MMX because emms overhead is quite large. Just tried SSE2. Here is the results of 4 runs: #1 old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 23s old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 22s #2 old_key = 18be678400294823, new_key = 4512153f17260b03, c++ = 30s old_key = 18be678400294823, new_key = 4512153f17260b03, asm = 26s old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 22s old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 23s #3 old_key = 18be678400294823, new_key = 4512153f17260b03, c++ = 30s old_key = 18be678400294823, new_key = 4512153f17260b03, asm = 26s old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 22s old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 22s #4 old_key = 18be678400294823, new_key = 4512153f17260b03, c++ = 30s old_key = 18be678400294823, new_key = 4512153f17260b03, asm = 26s old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 22s old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 22s As you see, I need a better gague to tell the difference between SSE and SSE2. __declspec(naked) void __fastcall update_key_non_capture_sse2(int move) { __asm { movzx eax, cl // from movzx edx, ch // to shr ecx, 10 // type * 64 and ecx, ~63 // mask off movdqa xmm2, [old_key] // old_key 128 add eax, ecx // type from index add edx, ecx // type to index movdqu xmm0, type_rnd[eax*8] // from 128 movdqu xmm1, type_rnd[edx*8] // to 128 pxor xmm0, xmm2 pxor xmm0, xmm1 movdqa [new_key], xmm0 // store 64 ret } } > >I'm intereseted in the assembler output of your C-routine - a bit stange that it >performs so "badly". Not too bad for my VC6 with sp5:) You may notice that I give the compiler some hints so that it does generate too lousy instructions. Perhaps the new VC compiler could do better. It uses one more register and has other problems too. That is why I hand write an asm version as a base line. Here is the compiler output (options: max speed, PPro): _TEXT SEGMENT ?update_key_non_capture@@YIXH@Z PROC NEAR ; update_key_non_capture, COMDAT ; _move$ = ecx ; Line 28 xor edx, edx mov dl, ch mov eax, ecx sar eax, 16 ; 00000010H movzx ecx, cl shl eax, 6 add edx, eax add ecx, eax mov eax, DWORD PTR ?type_rnd@@3PAY0EA@_KA[edx*8] mov edx, DWORD PTR ?type_rnd@@3PAY0EA@_KA[edx*8+4] push esi xor eax, DWORD PTR ?type_rnd@@3PAY0EA@_KA[ecx*8] mov esi, DWORD PTR ?type_rnd@@3PAY0EA@_KA[ecx*8+4] xor eax, DWORD PTR ?old_key@@3_KA mov ecx, DWORD PTR ?old_key@@3_KA+4 xor edx, esi xor edx, ecx mov DWORD PTR ?new_key@@3_KA, eax mov DWORD PTR ?new_key@@3_KA+4, edx pop esi ; Line 29 ret 0 ?update_key_non_capture@@YIXH@Z ENDP ; update_key_non_capture _TEXT ENDS > >Regards, >Gerd > > <snip>
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.