Author: Dezhi Zhao
Date: 16:04:03 09/04/03
Go up one level in this thread
On September 04, 2003 at 17:25:38, Gerd Isenberg wrote:
>On September 04, 2003 at 15:08:37, Dezhi Zhao wrote:
>
>>Yes, SSE could beat the regular by a small margin.
>
>Great - but try it in a real chess program.
>
>In these loop tests some internal unrolling may occur, with "renamed" register
>sets. I guess the loop-body including the called SSE-routine fits in P4's trace
>cache, and two (or more) bodies are executed simultaniusly.
>
>With the gp-register approach this is only partially possible, due to registers
>are changed several times inside the body.
I only did an easy timing here as you see. You can do an accurate timing with
the processor counters and cpuid and other instructions to isolate other
factors. However I do not think accurate timing could change the results.
I think this test simulate a real chess program quite well. Only noncapture are
tested here. In a real program, you have to handle captures that needs more xor
operations and more register pressure. Therefore, SSE will help more in a real
one.
>
>Anyway not that bad for P4, considering only 64-bits of 128 used per
>XMM-register. What about MMX on P4 and what about SSE2-integer instructions,
>movdqa, movdqu, movd and pxor? Ok one byte more opcode, but shorter latencies
>(at least on opteron for movdqa and pxor). I do not have P4 SSE-instruction
>latencies, but for opteron, mov unaligned is a killer due to vector path
>instruction, movups as well as movdqu.
I'm not interested at MMX because emms overhead is quite large. Just tried SSE2.
Here is the results of 4 runs:
#1
old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 23s
old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 22s
#2
old_key = 18be678400294823, new_key = 4512153f17260b03, c++ = 30s
old_key = 18be678400294823, new_key = 4512153f17260b03, asm = 26s
old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 22s
old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 23s
#3
old_key = 18be678400294823, new_key = 4512153f17260b03, c++ = 30s
old_key = 18be678400294823, new_key = 4512153f17260b03, asm = 26s
old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 22s
old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 22s
#4
old_key = 18be678400294823, new_key = 4512153f17260b03, c++ = 30s
old_key = 18be678400294823, new_key = 4512153f17260b03, asm = 26s
old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 22s
old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 22s
As you see, I need a better gague to tell the difference between SSE and SSE2.
__declspec(naked) void __fastcall update_key_non_capture_sse2(int move)
{
__asm
{
movzx eax, cl // from
movzx edx, ch // to
shr ecx, 10 // type * 64
and ecx, ~63 // mask off
movdqa xmm2, [old_key] // old_key 128
add eax, ecx // type from index
add edx, ecx // type to index
movdqu xmm0, type_rnd[eax*8] // from 128
movdqu xmm1, type_rnd[edx*8] // to 128
pxor xmm0, xmm2
pxor xmm0, xmm1
movdqa [new_key], xmm0 // store 64
ret
}
}
>
>I'm intereseted in the assembler output of your C-routine - a bit stange that it
>performs so "badly".
Not too bad for my VC6 with sp5:) You may notice that I give the compiler some
hints so that it does generate too lousy instructions. Perhaps the new VC
compiler could do better. It uses one more register and has other problems too.
That is why I hand write an asm version as a base line. Here is the compiler
output (options: max speed, PPro):
_TEXT SEGMENT
?update_key_non_capture@@YIXH@Z PROC NEAR ; update_key_non_capture, COMDAT
; _move$ = ecx
; Line 28
xor edx, edx
mov dl, ch
mov eax, ecx
sar eax, 16 ; 00000010H
movzx ecx, cl
shl eax, 6
add edx, eax
add ecx, eax
mov eax, DWORD PTR ?type_rnd@@3PAY0EA@_KA[edx*8]
mov edx, DWORD PTR ?type_rnd@@3PAY0EA@_KA[edx*8+4]
push esi
xor eax, DWORD PTR ?type_rnd@@3PAY0EA@_KA[ecx*8]
mov esi, DWORD PTR ?type_rnd@@3PAY0EA@_KA[ecx*8+4]
xor eax, DWORD PTR ?old_key@@3_KA
mov ecx, DWORD PTR ?old_key@@3_KA+4
xor edx, esi
xor edx, ecx
mov DWORD PTR ?new_key@@3_KA, eax
mov DWORD PTR ?new_key@@3_KA+4, edx
pop esi
; Line 29
ret 0
?update_key_non_capture@@YIXH@Z ENDP ; update_key_non_capture
_TEXT ENDS
>
>Regards,
>Gerd
>
>
<snip>
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.