Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Here is some test data

Author: Gerd Isenberg

Date: 16:36:31 09/04/03

Go up one level in this thread


On September 04, 2003 at 19:04:03, Dezhi Zhao wrote:

>On September 04, 2003 at 17:25:38, Gerd Isenberg wrote:
>
>>On September 04, 2003 at 15:08:37, Dezhi Zhao wrote:
>>
>>>Yes, SSE could beat the regular by a small margin.
>>
>>Great - but try it in a real chess program.
>>
>>In these loop tests some internal unrolling may occur, with "renamed" register
>>sets. I guess the loop-body including the called SSE-routine fits in P4's trace
>>cache, and two (or more) bodies are executed simultaniusly.
>>
>>With the gp-register approach this is only partially possible, due to registers
>>are changed several times inside the body.
>
>I only did an easy timing here as you see. You can do an accurate timing with
>the processor counters and cpuid and other instructions to isolate other
>factors. However I do not think accurate timing could change the results.
>
>I think this test simulate a real chess program quite well. Only noncapture are
>tested here.

No - you don't have such heavy hashupdate loops in a chess program, there are
some instructions around.

>In a real program, you have to handle captures that needs more xor
>operations and more register pressure. Therefore, SSE will help more in a real
>one.

Yes, if you do even more than only hashupdate, it becomes interesting in the
context of the copymake thread here, possibly using all 128-bits.

That's the reason i will try to use them for kogge-stone-fill attack generation
on opteron. Doing simd with two sets in parallel.


>
>>
>>Anyway not that bad for P4, considering only 64-bits of 128 used per
>>XMM-register. What about MMX on P4 and what about SSE2-integer instructions,
>>movdqa, movdqu, movd and pxor? Ok one byte more opcode, but shorter latencies
>>(at least on opteron for movdqa and pxor). I do not have P4 SSE-instruction
>>latencies, but for opteron, mov unaligned is a killer due to vector path
>>instruction, movups as well as movdqu.
>
>I'm not interested at MMX because emms overhead is quite large. Just tried SSE2.

Do you use floats or doubles? If not, ignore emms.
Use simply movd for all 8-byte aligned moves, and pxor.
I'm interesting whether mmx performs faster due to 64-bit rather than 128-bit
instructions on P4.

>Here is the results of 4 runs:
>
>#1
>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse = 23s
>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse2 = 22s
>
>#2
>old_key = 18be678400294823, new_key = 4512153f17260b03,  c++ = 30s
>old_key = 18be678400294823, new_key = 4512153f17260b03,  asm = 26s
>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse = 22s
>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse2 = 23s
>
>#3
>old_key = 18be678400294823, new_key = 4512153f17260b03,  c++ = 30s
>old_key = 18be678400294823, new_key = 4512153f17260b03,  asm = 26s
>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse = 22s
>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse2 = 22s
>
>#4
>old_key = 18be678400294823, new_key = 4512153f17260b03,  c++ = 30s
>old_key = 18be678400294823, new_key = 4512153f17260b03,  asm = 26s
>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse = 22s
>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse2 = 22s
>
>As you see, I need a better gague to tell the difference between SSE and SSE2.
>
>__declspec(naked) void __fastcall update_key_non_capture_sse2(int move)
>{
>	__asm
>	{
>		movzx	eax, cl         // from
>		movzx	edx, ch		// to
>		shr	ecx, 10		// type * 64
>		and	ecx, ~63	// mask off
>		movdqa	xmm2, [old_key]	// old_key 128
>
>		add	eax, ecx	// type from index
>		add edx, ecx		// type to index
>
>		movdqu	xmm0, type_rnd[eax*8]	// from 128
>		movdqu	xmm1, type_rnd[edx*8]	// to 128
>		pxor	xmm0, xmm2
>		pxor	xmm0, xmm1
>
>		movdqa	[new_key], xmm0		// store 64
>		ret
>	}
>}
>
>>
>>I'm intereseted in the assembler output of your C-routine - a bit stange that it
>>performs so "badly".
>
>Not too bad for my VC6 with sp5:) You may notice that I give the compiler some
>hints so that it does generate too lousy instructions. Perhaps the new VC
>compiler could do better. It uses one more register and has other problems too.
>That is why I hand write an asm version as a base line. Here is the compiler
>output (options: max speed, PPro):
>


Thanks, i see - sometimes an additional register gains more parallelism ;-)
Seems not be the case here.


>_TEXT	SEGMENT
>?update_key_non_capture@@YIXH@Z PROC NEAR		; update_key_non_capture, COMDAT
>; _move$ = ecx
>; Line 28
>	xor	edx, edx
>	mov	dl, ch
>	mov	eax, ecx
>	sar	eax, 16					; 00000010H
>	movzx	ecx, cl
>	shl	eax, 6
>	add	edx, eax
>	add	ecx, eax
>	mov	eax, DWORD PTR ?type_rnd@@3PAY0EA@_KA[edx*8]
>	mov	edx, DWORD PTR ?type_rnd@@3PAY0EA@_KA[edx*8+4]
>	push	esi
>	xor	eax, DWORD PTR ?type_rnd@@3PAY0EA@_KA[ecx*8]
>	mov	esi, DWORD PTR ?type_rnd@@3PAY0EA@_KA[ecx*8+4]
>	xor	eax, DWORD PTR ?old_key@@3_KA
>	mov	ecx, DWORD PTR ?old_key@@3_KA+4
>	xor	edx, esi
>	xor	edx, ecx
>	mov	DWORD PTR ?new_key@@3_KA, eax
>	mov	DWORD PTR ?new_key@@3_KA+4, edx
>	pop	esi
>; Line 29
>	ret	0
>?update_key_non_capture@@YIXH@Z ENDP			; update_key_non_capture
>_TEXT	ENDS
>
>
>>
>>Regards,
>>Gerd
>>
>>
>
><snip>



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.