Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Here is some test data

Author: Dezhi Zhao

Date: 17:07:09 09/04/03

Go up one level in this thread


On September 04, 2003 at 19:36:31, Gerd Isenberg wrote:

>On September 04, 2003 at 19:04:03, Dezhi Zhao wrote:
>
>>On September 04, 2003 at 17:25:38, Gerd Isenberg wrote:
>>
>>>On September 04, 2003 at 15:08:37, Dezhi Zhao wrote:
>>>
>>>>Yes, SSE could beat the regular by a small margin.
>>>
>>>Great - but try it in a real chess program.
>>>
>>>In these loop tests some internal unrolling may occur, with "renamed" register
>>>sets. I guess the loop-body including the called SSE-routine fits in P4's trace
>>>cache, and two (or more) bodies are executed simultaniusly.
>>>
>>>With the gp-register approach this is only partially possible, due to registers
>>>are changed several times inside the body.
>>
>>I only did an easy timing here as you see. You can do an accurate timing with
>>the processor counters and cpuid and other instructions to isolate other
>>factors. However I do not think accurate timing could change the results.
>>
>>I think this test simulate a real chess program quite well. Only noncapture are
>>tested here.
>
>No - you don't have such heavy hashupdate loops in a chess program, there are
>some instructions around.

The loops here only serves as an amplifier so that you you know how much time
you spend on a single iteration which is the case in a real chess program.

I know that you are concerned with code trace stuff. However all these functions
are so small that all of them would sit in the trace well.

I think Dann Corbit (sorry I'm not sure I spell his name correctly) posted an
accurate timing function here years ago. Maybe we can use it if somebody can
find it.

>
>>In a real program, you have to handle captures that needs more xor
>>operations and more register pressure. Therefore, SSE will help more in a real
>>one.
>
>Yes, if you do even more than only hashupdate, it becomes interesting in the
>context of the copymake thread here, possibly using all 128-bits.
>
>That's the reason i will try to use them for kogge-stone-fill attack generation
>on opteron. Doing simd with two sets in parallel.
>
>
>>
>>>
>>>Anyway not that bad for P4, considering only 64-bits of 128 used per
>>>XMM-register. What about MMX on P4 and what about SSE2-integer instructions,
>>>movdqa, movdqu, movd and pxor? Ok one byte more opcode, but shorter latencies
>>>(at least on opteron for movdqa and pxor). I do not have P4 SSE-instruction
>>>latencies, but for opteron, mov unaligned is a killer due to vector path
>>>instruction, movups as well as movdqu.
>>
>>I'm not interested at MMX because emms overhead is quite large. Just tried SSE2.
>
>Do you use floats or doubles? If not, ignore emms.
>Use simply movd for all 8-byte aligned moves, and pxor.
>I'm interesting whether mmx performs faster due to 64-bit rather than 128-bit
>instructions on P4.

I dont use floats but I'm not sure if MFC uses them. I'm going to make a mmx
function tommorrow.

>
>>Here is the results of 4 runs:
>>
>>#1
>>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse = 23s
>>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse2 = 22s
>>
>>#2
>>old_key = 18be678400294823, new_key = 4512153f17260b03,  c++ = 30s
>>old_key = 18be678400294823, new_key = 4512153f17260b03,  asm = 26s
>>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse = 22s
>>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse2 = 23s
>>
>>#3
>>old_key = 18be678400294823, new_key = 4512153f17260b03,  c++ = 30s
>>old_key = 18be678400294823, new_key = 4512153f17260b03,  asm = 26s
>>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse = 22s
>>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse2 = 22s
>>
>>#4
>>old_key = 18be678400294823, new_key = 4512153f17260b03,  c++ = 30s
>>old_key = 18be678400294823, new_key = 4512153f17260b03,  asm = 26s
>>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse = 22s
>>old_key = 18be678400294823, new_key = 4512153f17260b03,  sse2 = 22s
>>
>>As you see, I need a better gague to tell the difference between SSE and SSE2.
>>
>>__declspec(naked) void __fastcall update_key_non_capture_sse2(int move)
>>{
>>	__asm
>>	{
>>		movzx	eax, cl         // from
>>		movzx	edx, ch		// to
>>		shr	ecx, 10		// type * 64
>>		and	ecx, ~63	// mask off
>>		movdqa	xmm2, [old_key]	// old_key 128
>>
>>		add	eax, ecx	// type from index
>>		add edx, ecx		// type to index
>>
>>		movdqu	xmm0, type_rnd[eax*8]	// from 128
>>		movdqu	xmm1, type_rnd[edx*8]	// to 128
>>		pxor	xmm0, xmm2
>>		pxor	xmm0, xmm1
>>
>>		movdqa	[new_key], xmm0		// store 64
>>		ret
>>	}
>>}
>>
>>>
>>>I'm intereseted in the assembler output of your C-routine - a bit stange that it
>>>performs so "badly".
>>
>>Not too bad for my VC6 with sp5:) You may notice that I give the compiler some
>>hints so that it does generate too lousy instructions. Perhaps the new VC
>>compiler could do better. It uses one more register and has other problems too.
>>That is why I hand write an asm version as a base line. Here is the compiler
>>output (options: max speed, PPro):
>>
>
>
>Thanks, i see - sometimes an additional register gains more parallelism ;-)
>Seems not be the case here.

If you use more registers than necessary, the CPU has to do less register
renaming work. Another obvious down side is more push and pop operations.

>
>
>>_TEXT	SEGMENT
>>?update_key_non_capture@@YIXH@Z PROC NEAR		; update_key_non_capture, COMDAT
>>; _move$ = ecx
>>; Line 28
>>	xor	edx, edx
>>	mov	dl, ch
>>	mov	eax, ecx
>>	sar	eax, 16					; 00000010H
>>	movzx	ecx, cl
>>	shl	eax, 6
>>	add	edx, eax
>>	add	ecx, eax
>>	mov	eax, DWORD PTR ?type_rnd@@3PAY0EA@_KA[edx*8]
>>	mov	edx, DWORD PTR ?type_rnd@@3PAY0EA@_KA[edx*8+4]
>>	push	esi
>>	xor	eax, DWORD PTR ?type_rnd@@3PAY0EA@_KA[ecx*8]
>>	mov	esi, DWORD PTR ?type_rnd@@3PAY0EA@_KA[ecx*8+4]
>>	xor	eax, DWORD PTR ?old_key@@3_KA
>>	mov	ecx, DWORD PTR ?old_key@@3_KA+4
>>	xor	edx, esi
>>	xor	edx, ecx
>>	mov	DWORD PTR ?new_key@@3_KA, eax
>>	mov	DWORD PTR ?new_key@@3_KA+4, edx
>>	pop	esi
>>; Line 29
>>	ret	0
>>?update_key_non_capture@@YIXH@Z ENDP			; update_key_non_capture
>>_TEXT	ENDS
>>
>>
>>>
>>>Regards,
>>>Gerd
>>>
>>>
>>
>><snip>



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.