Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Help! Visual C++ intrinsics! (Nalimov, are you around here somewhere

Author: Gerd Isenberg
Date: 09:05:57 10/02/03
On October 02, 2003 at 09:27:04, Anthony Cozzie wrote:

>typedef __int64 __declspec(align(8)) bb;
>bb test(bb  *a, bb *b);
>
>//stupid solution generation and their stupid precompiled headers
>int _tmain(int argc, _TCHAR* argv[])
>{
>	int t;
>	bb a, b, c;
>
>	a = 0x53243347832;
>	b = 0xFFA43873482;
>
>	scanf("%d", &t);
>
>	a = a | t;
>
>	c = test(&a, &b);
>
>	printf("%I64d\n", (__int64)c);
>
>	return 0;
>}
>
>bb test(bb  *a, bb *b)
>{
>00401080  push        ebp
>00401081  mov         ebp,esp
>00401083  and         esp,0FFFFFFF8h
>00401086  sub         esp,18h
>	__m64 _a, _b;
>
>	_a.m64_u64 = *a;
>00401089  mov         ecx,dword ptr [ebp+8]
>0040108C  mov         eax,dword ptr [___security_cookie (40A040h)]
>00401091  mov         edx,dword ptr [ecx+4]
>00401094  mov         dword ptr [esp+14h],eax
>00401098  mov         eax,dword ptr [ecx]
>0040109A  mov         dword ptr [esp+8],eax
>	_b.m64_u64 = *b;
>0040109E  mov         eax,dword ptr [b]
>004010A1  mov         dword ptr [esp+0Ch],edx
>004010A5  mov         edx,dword ptr [eax]
>004010A7  mov         eax,dword ptr [eax+4]
>004010AA  mov         dword ptr [esp],edx
>004010AD  mov         dword ptr [esp+4],eax
>
>	_a = _m_pxor(_a, _b);
>004010B1  movq        mm0,mmword ptr [esp]
>004010B5  movq        mm1,mmword ptr [esp+8]
>004010BA  pxor        mm1,mm0
>
>	*a = _a.m64_u64;
>004010BD  movq        mmword ptr [ecx],mm1
>
>	return *a;
>004010C0  mov         eax,dword ptr [ecx]
>004010C2  mov         edx,dword ptr [ecx+4]

oups, that is worse too. Ok, better than two movds, but in general i think it is
not a good idea with mmx or sse2-intrinsics to return bitboards via edx:eax or
rax. Maybe same for passing bitboards via stack. Better to use some class/struct
pointer and load/store 8 (16) byte aligned directly via movq (movdqa).

See AMD Athlon Processor x86 Code Optimization Guide TM
Chapter 5 Cache and Memory Optimizations

Top!
Avoid Memory Size Mismatches

Avoid memory size mismatches when different instructions
operate on the same data. When an instruction stores and
another instruction reloads the same data, keep their operands
aligned and keep the loads/stores of each operand the same size.
The fol lowing code examples result in a
store-to-load-forwarding (STLF) stall:

Example (avoid):
MOVQ [foo ],MM0
...
MOV EAX,[foo ]
MOV EDX,[foo+4 ]

Example (preferred):
MOVD [foo ],MM0
PSWAPD MM0,MM0
MOVD [foo+4 ],MM0
PSWAPD MM0,MM0
...
MOV EAX,[foo ]
MOV EDX,[foo+4 ]

Example (preferred if the contents of MM0 is no longer needed):
MOVD [foo ],MM0
PUNPCKHDQ MM0,MM0
MOVD [foo+4 ],MM0
...
MOV EAX,[foo ]
MOV EDX,[foo+4 ]

Example (preferred if the stores and loads are close together, option 1):
MOVD EAX,MM0
PSWAPD MM0,MM0
MOVD EDX,MM0
PSWAPD MM0,MM0
MOVD EAX,MM0
PUNPCKHDQ MM0,MM0
MOVD EDX,MM0

>}
>004010C5  mov         ecx,dword ptr [esp+14h]
>004010C9  call        __security_check_cookie (40114Bh)
>004010CE  mov         esp,ebp
>004010D0  pop         ebp
>004010D1  ret
>
>So here we have a function that is given a known 8-byte aligned integer, and it
>*still* has to go through the registers.  From a quick glance at the code, it
>looks like the pointers are passed in eax/ecx.  This whole piece of code could
>be compressed down to
>
>movq      mm0, mmword ptr [ecx]
>...
>
>I wish I could find a "load" intrinsic for 64 bits, but there doesn't seem to be
>one.

Yes, only some "set" functions, nothing similar to _mm_load_si128,
_mm_store_si128. One reason more to use sse2 only for amd64.

Gerd


>
>anthony
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.