Author: Gerd Isenberg
Date: 16:36:31 09/04/03
Go up one level in this thread
On September 04, 2003 at 19:04:03, Dezhi Zhao wrote:
>On September 04, 2003 at 17:25:38, Gerd Isenberg wrote:
>
>>On September 04, 2003 at 15:08:37, Dezhi Zhao wrote:
>>
>>>Yes, SSE could beat the regular by a small margin.
>>
>>Great - but try it in a real chess program.
>>
>>In these loop tests some internal unrolling may occur, with "renamed" register
>>sets. I guess the loop-body including the called SSE-routine fits in P4's trace
>>cache, and two (or more) bodies are executed simultaniusly.
>>
>>With the gp-register approach this is only partially possible, due to registers
>>are changed several times inside the body.
>
>I only did an easy timing here as you see. You can do an accurate timing with
>the processor counters and cpuid and other instructions to isolate other
>factors. However I do not think accurate timing could change the results.
>
>I think this test simulate a real chess program quite well. Only noncapture are
>tested here.
No - you don't have such heavy hashupdate loops in a chess program, there are
some instructions around.
>In a real program, you have to handle captures that needs more xor
>operations and more register pressure. Therefore, SSE will help more in a real
>one.
Yes, if you do even more than only hashupdate, it becomes interesting in the
context of the copymake thread here, possibly using all 128-bits.
That's the reason i will try to use them for kogge-stone-fill attack generation
on opteron. Doing simd with two sets in parallel.
>
>>
>>Anyway not that bad for P4, considering only 64-bits of 128 used per
>>XMM-register. What about MMX on P4 and what about SSE2-integer instructions,
>>movdqa, movdqu, movd and pxor? Ok one byte more opcode, but shorter latencies
>>(at least on opteron for movdqa and pxor). I do not have P4 SSE-instruction
>>latencies, but for opteron, mov unaligned is a killer due to vector path
>>instruction, movups as well as movdqu.
>
>I'm not interested at MMX because emms overhead is quite large. Just tried SSE2.
Do you use floats or doubles? If not, ignore emms.
Use simply movd for all 8-byte aligned moves, and pxor.
I'm interesting whether mmx performs faster due to 64-bit rather than 128-bit
instructions on P4.
>Here is the results of 4 runs:
>
>#1
>old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 23s
>old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 22s
>
>#2
>old_key = 18be678400294823, new_key = 4512153f17260b03, c++ = 30s
>old_key = 18be678400294823, new_key = 4512153f17260b03, asm = 26s
>old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 22s
>old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 23s
>
>#3
>old_key = 18be678400294823, new_key = 4512153f17260b03, c++ = 30s
>old_key = 18be678400294823, new_key = 4512153f17260b03, asm = 26s
>old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 22s
>old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 22s
>
>#4
>old_key = 18be678400294823, new_key = 4512153f17260b03, c++ = 30s
>old_key = 18be678400294823, new_key = 4512153f17260b03, asm = 26s
>old_key = 18be678400294823, new_key = 4512153f17260b03, sse = 22s
>old_key = 18be678400294823, new_key = 4512153f17260b03, sse2 = 22s
>
>As you see, I need a better gague to tell the difference between SSE and SSE2.
>
>__declspec(naked) void __fastcall update_key_non_capture_sse2(int move)
>{
> __asm
> {
> movzx eax, cl // from
> movzx edx, ch // to
> shr ecx, 10 // type * 64
> and ecx, ~63 // mask off
> movdqa xmm2, [old_key] // old_key 128
>
> add eax, ecx // type from index
> add edx, ecx // type to index
>
> movdqu xmm0, type_rnd[eax*8] // from 128
> movdqu xmm1, type_rnd[edx*8] // to 128
> pxor xmm0, xmm2
> pxor xmm0, xmm1
>
> movdqa [new_key], xmm0 // store 64
> ret
> }
>}
>
>>
>>I'm intereseted in the assembler output of your C-routine - a bit stange that it
>>performs so "badly".
>
>Not too bad for my VC6 with sp5:) You may notice that I give the compiler some
>hints so that it does generate too lousy instructions. Perhaps the new VC
>compiler could do better. It uses one more register and has other problems too.
>That is why I hand write an asm version as a base line. Here is the compiler
>output (options: max speed, PPro):
>
Thanks, i see - sometimes an additional register gains more parallelism ;-)
Seems not be the case here.
>_TEXT SEGMENT
>?update_key_non_capture@@YIXH@Z PROC NEAR ; update_key_non_capture, COMDAT
>; _move$ = ecx
>; Line 28
> xor edx, edx
> mov dl, ch
> mov eax, ecx
> sar eax, 16 ; 00000010H
> movzx ecx, cl
> shl eax, 6
> add edx, eax
> add ecx, eax
> mov eax, DWORD PTR ?type_rnd@@3PAY0EA@_KA[edx*8]
> mov edx, DWORD PTR ?type_rnd@@3PAY0EA@_KA[edx*8+4]
> push esi
> xor eax, DWORD PTR ?type_rnd@@3PAY0EA@_KA[ecx*8]
> mov esi, DWORD PTR ?type_rnd@@3PAY0EA@_KA[ecx*8+4]
> xor eax, DWORD PTR ?old_key@@3_KA
> mov ecx, DWORD PTR ?old_key@@3_KA+4
> xor edx, esi
> xor edx, ecx
> mov DWORD PTR ?new_key@@3_KA, eax
> mov DWORD PTR ?new_key@@3_KA+4, edx
> pop esi
>; Line 29
> ret 0
>?update_key_non_capture@@YIXH@Z ENDP ; update_key_non_capture
>_TEXT ENDS
>
>
>>
>>Regards,
>>Gerd
>>
>>
>
><snip>
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.