Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Here is some test data

Author: Dezhi Zhao

Date: 16:04:03 09/04/03

Go up one level in this thread


On September 04, 2003 at 17:25:38, Gerd Isenberg wrote:

>On September 04, 2003 at 15:08:37, Dezhi Zhao wrote:
>
>>Yes, SSE could beat the regular by a small margin.
>
>Great - but try it in a real chess program.
>
>In these loop tests some internal unrolling may occur, with "renamed" register
>sets. I guess the loop-body including the called SSE-routine fits in P4's trace
>cache, and two (or more) bodies are executed simultaniusly.
>
>With the gp-register approach this is only partially possible, due to registers
>are changed several times inside the body.

I only did an easy timing here as you see. You can do an accurate timing with
the processor counters and cpuid and other instructions to isolate other
factors. However I do not think accurate timing could change the results.

I think this test simulate a real chess program quite well. Only noncapture are
tested here. In a real program, you have to handle captures that needs more xor
operations and more register pressure. Therefore, SSE will help more in a real
one.

>
>Anyway not that bad for P4, considering only 64-bits of 128 used per
>XMM-register. What about MMX on P4 and what about SSE2-integer instructions,
>movdqa, movdqu, movd and pxor? Ok one byte more opcode, but shorter latencies
>(at least on opteron for movdqa and pxor). I do not have P4 SSE-instruction
>latencies, but for opteron, mov unaligned is a killer due to vector path
>instruction, movups as well as movdqu.

I'm not interested at MMX because emms overhead is quite large. Just tried SSE2.
Here is the results of 4 runs:

#1
old_key = 18be678400294823, new_key = 4512153f17260b03,  sse = 23s
old_key = 18be678400294823, new_key = 4512153f17260b03,  sse2 = 22s

#2
old_key = 18be678400294823, new_key = 4512153f17260b03,  c++ = 30s
old_key = 18be678400294823, new_key = 4512153f17260b03,  asm = 26s
old_key = 18be678400294823, new_key = 4512153f17260b03,  sse = 22s
old_key = 18be678400294823, new_key = 4512153f17260b03,  sse2 = 23s

#3
old_key = 18be678400294823, new_key = 4512153f17260b03,  c++ = 30s
old_key = 18be678400294823, new_key = 4512153f17260b03,  asm = 26s
old_key = 18be678400294823, new_key = 4512153f17260b03,  sse = 22s
old_key = 18be678400294823, new_key = 4512153f17260b03,  sse2 = 22s

#4
old_key = 18be678400294823, new_key = 4512153f17260b03,  c++ = 30s
old_key = 18be678400294823, new_key = 4512153f17260b03,  asm = 26s
old_key = 18be678400294823, new_key = 4512153f17260b03,  sse = 22s
old_key = 18be678400294823, new_key = 4512153f17260b03,  sse2 = 22s

As you see, I need a better gague to tell the difference between SSE and SSE2.

__declspec(naked) void __fastcall update_key_non_capture_sse2(int move)
{
	__asm
	{
		movzx	eax, cl         // from
		movzx	edx, ch		// to
		shr	ecx, 10		// type * 64
		and	ecx, ~63	// mask off
		movdqa	xmm2, [old_key]	// old_key 128

		add	eax, ecx	// type from index
		add edx, ecx		// type to index

		movdqu	xmm0, type_rnd[eax*8]	// from 128
		movdqu	xmm1, type_rnd[edx*8]	// to 128
		pxor	xmm0, xmm2
		pxor	xmm0, xmm1

		movdqa	[new_key], xmm0		// store 64
		ret
	}
}

>
>I'm intereseted in the assembler output of your C-routine - a bit stange that it
>performs so "badly".

Not too bad for my VC6 with sp5:) You may notice that I give the compiler some
hints so that it does generate too lousy instructions. Perhaps the new VC
compiler could do better. It uses one more register and has other problems too.
That is why I hand write an asm version as a base line. Here is the compiler
output (options: max speed, PPro):

_TEXT	SEGMENT
?update_key_non_capture@@YIXH@Z PROC NEAR		; update_key_non_capture, COMDAT
; _move$ = ecx
; Line 28
	xor	edx, edx
	mov	dl, ch
	mov	eax, ecx
	sar	eax, 16					; 00000010H
	movzx	ecx, cl
	shl	eax, 6
	add	edx, eax
	add	ecx, eax
	mov	eax, DWORD PTR ?type_rnd@@3PAY0EA@_KA[edx*8]
	mov	edx, DWORD PTR ?type_rnd@@3PAY0EA@_KA[edx*8+4]
	push	esi
	xor	eax, DWORD PTR ?type_rnd@@3PAY0EA@_KA[ecx*8]
	mov	esi, DWORD PTR ?type_rnd@@3PAY0EA@_KA[ecx*8+4]
	xor	eax, DWORD PTR ?old_key@@3_KA
	mov	ecx, DWORD PTR ?old_key@@3_KA+4
	xor	edx, esi
	xor	edx, ecx
	mov	DWORD PTR ?new_key@@3_KA, eax
	mov	DWORD PTR ?new_key@@3_KA+4, edx
	pop	esi
; Line 29
	ret	0
?update_key_non_capture@@YIXH@Z ENDP			; update_key_non_capture
_TEXT	ENDS


>
>Regards,
>Gerd
>
>

<snip>




This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.