Author: Gerd Isenberg
Date: 09:05:57 10/02/03
Go up one level in this thread
On October 02, 2003 at 09:27:04, Anthony Cozzie wrote:
>typedef __int64 __declspec(align(8)) bb;
>bb test(bb *a, bb *b);
>
>//stupid solution generation and their stupid precompiled headers
>int _tmain(int argc, _TCHAR* argv[])
>{
> int t;
> bb a, b, c;
>
> a = 0x53243347832;
> b = 0xFFA43873482;
>
> scanf("%d", &t);
>
> a = a | t;
>
> c = test(&a, &b);
>
> printf("%I64d\n", (__int64)c);
>
> return 0;
>}
>
>bb test(bb *a, bb *b)
>{
>00401080 push ebp
>00401081 mov ebp,esp
>00401083 and esp,0FFFFFFF8h
>00401086 sub esp,18h
> __m64 _a, _b;
>
> _a.m64_u64 = *a;
>00401089 mov ecx,dword ptr [ebp+8]
>0040108C mov eax,dword ptr [___security_cookie (40A040h)]
>00401091 mov edx,dword ptr [ecx+4]
>00401094 mov dword ptr [esp+14h],eax
>00401098 mov eax,dword ptr [ecx]
>0040109A mov dword ptr [esp+8],eax
> _b.m64_u64 = *b;
>0040109E mov eax,dword ptr [b]
>004010A1 mov dword ptr [esp+0Ch],edx
>004010A5 mov edx,dword ptr [eax]
>004010A7 mov eax,dword ptr [eax+4]
>004010AA mov dword ptr [esp],edx
>004010AD mov dword ptr [esp+4],eax
>
> _a = _m_pxor(_a, _b);
>004010B1 movq mm0,mmword ptr [esp]
>004010B5 movq mm1,mmword ptr [esp+8]
>004010BA pxor mm1,mm0
>
> *a = _a.m64_u64;
>004010BD movq mmword ptr [ecx],mm1
>
> return *a;
>004010C0 mov eax,dword ptr [ecx]
>004010C2 mov edx,dword ptr [ecx+4]
oups, that is worse too. Ok, better than two movds, but in general i think it is
not a good idea with mmx or sse2-intrinsics to return bitboards via edx:eax or
rax. Maybe same for passing bitboards via stack. Better to use some class/struct
pointer and load/store 8 (16) byte aligned directly via movq (movdqa).
See AMD Athlon Processor x86 Code Optimization Guide TM
Chapter 5 Cache and Memory Optimizations
Top!
Avoid Memory Size Mismatches
Avoid memory size mismatches when different instructions
operate on the same data. When an instruction stores and
another instruction reloads the same data, keep their operands
aligned and keep the loads/stores of each operand the same size.
The fol lowing code examples result in a
store-to-load-forwarding (STLF) stall:
Example (avoid):
MOVQ [foo ],MM0
...
MOV EAX,[foo ]
MOV EDX,[foo+4 ]
Example (preferred):
MOVD [foo ],MM0
PSWAPD MM0,MM0
MOVD [foo+4 ],MM0
PSWAPD MM0,MM0
...
MOV EAX,[foo ]
MOV EDX,[foo+4 ]
Example (preferred if the contents of MM0 is no longer needed):
MOVD [foo ],MM0
PUNPCKHDQ MM0,MM0
MOVD [foo+4 ],MM0
...
MOV EAX,[foo ]
MOV EDX,[foo+4 ]
Example (preferred if the stores and loads are close together, option 1):
MOVD EAX,MM0
PSWAPD MM0,MM0
MOVD EDX,MM0
PSWAPD MM0,MM0
MOVD EAX,MM0
PUNPCKHDQ MM0,MM0
MOVD EDX,MM0
>}
>004010C5 mov ecx,dword ptr [esp+14h]
>004010C9 call __security_check_cookie (40114Bh)
>004010CE mov esp,ebp
>004010D0 pop ebp
>004010D1 ret
>
>So here we have a function that is given a known 8-byte aligned integer, and it
>*still* has to go through the registers. From a quick glance at the code, it
>looks like the pointers are passed in eax/ecx. This whole piece of code could
>be compressed down to
>
>movq mm0, mmword ptr [ecx]
>...
>
>I wish I could find a "load" intrinsic for 64 bits, but there doesn't seem to be
>one.
Yes, only some "set" functions, nothing similar to _mm_load_si128,
_mm_store_si128. One reason more to use sse2 only for amd64.
Gerd
>
>anthony
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.