Author: Eugene Nalimov
Date: 13:16:40 11/24/05
Go up one level in this thread
On November 24, 2005 at 14:41:16, Gerd Isenberg wrote:
>...
>>A sunken itanium ship with a price of like 8000 dollar a chip, for a chip not
>>yet there, but only if you buy a 1000 of them, otherwise it'll be more like
>>20000 dollar a chip, is not a good compare. Despite intel giving masses of those
>>chips for free to SGI, SGI despite all that still has been removed from the New
>>York Stock Exchange. It's no longer there.
>>
>>x64 is more interesting.
>
>Agreed.
You can buy dual-CPU Itanium2 system with SCSI disk drives and several gigabytes
of RAM for less than $5k, so single Itanium2 CPU probably costs less than $8k
even in smaller quantites. But I undersand that for most needs AMD64/EM64T is
more cost effective solution.
>...
>
>Hehe, yes - with zillions branches you will pollute your btb and other branch
>prediction ressources.
>
>What about some cmovs enabled by pogo - if profiler claims too many
>mispredictions for jxx.
>
>If you have a lot of code in your eval and elsewhere, where you simply do some
>conditional add or sub - and the condition is rather random and therefor likely
>to produce some mispredictions with this very short bodies - you are free to use
>the boolean*int multiplication on C level. For Less/greater compares you are
>able to convert the compare expression to (arithmeticExpression < 0) and to do a
>shift arithmetic right with arithmeticExpression and (sizeof(int)*8-1), to get
>a {-1,0}-range - a mask to apply the boolean multiplication by bitwise and.
>
>From the initial conditional statement ...
>
>if ( distance1 < distance2 )
> eval += bonus;
>
>... via < 0 ...
Oops, here is problem. Transformation is not legal for some input values.
Programmer can do it if range of values is known not to cause overflow during
subtraction. Compiler (usually) cannot be sure, so we cannot do this
transformation.
Thanks,
Eugene
>if ( distance1 - distance2 < 0 )
> eval += bonus;
>
>... to the final sub(lea)-sar-and expression:
>
>eval += ( (distance1 - distance2) >> 31 ) & bonus;
>
>That's really cheap, if P4 don't cares.
>It requires only one extra register - while cmov requires two.
>If one argument is a constant, while the other is scaled by 2,4,8 it is
>still one lea instruction to produce the difference in a register.
>(Otherwise i prefere mov-add/sub because lea is two cycles on AMD64).
>
>Of course it is weird to sacrifice readability to avoid branches!
>So for better readability and maintainability i suggest some macros or inliners:
>
>__forceinline
>int ifAlessBthenCelse0(int a, int b, int c) { ((a-b)>>31) & c; }
>
>eval += ifAlessBthenCelse0(distance1, distance2, bonus);-)
>
>Take care not accidently to violate the us. patent by sun!
>The "Apparatus for directing a parallel processing computing device to form an
>absolute value of a signed value":
>
>abs(x) ::= ( x ^ (x>>31) ) - (x>>31)
>
>The {-1,0}-mask is used here to build the one's complement if negative, while
>subtract minus one does the two's complement.
>
>http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=/netahtml/search-adv.htm&r=1&f=G&l=50&d=ptxt&S1=6073150&OS=6073150&RS=6073150
>
>May be intel's weird shift implementation in P4 was the revenge ;-)
>
>What about to force cdq for x86 in C by casting to long long or __int64 and then
>take the high dword where the sign extension occured as the mask? Does that
>violate the patent?
>
>int abs(int x) {
> int m = (int)( (unsigned __int64)((__int64)x) >> 32 );
> return (x ^ m) - m;
>}
>
>and the generated assembly:
>
>?abs@@YIHH@Z PROC NEAR ; abs, COMDAT
>; _x$ = ecx
> 00000 8b c1 mov eax, ecx
> 00002 99 cdq
> 00003 8b c2 mov eax, edx
> 00005 33 c1 xor eax, ecx
> 00007 2b c2 sub eax, edx
> 00009 c3 ret 0
>?abs@@YIHH@Z ENDP ; abs
>
>Or what about this?
>
>int abs(int x) {return (x ^ -(x < 0)) + (x < 0);}
>
>?abs@@YIHH@Z PROC NEAR ; abs, COMDAT
>; _x$ = ecx
> 00000 33 d2 xor edx, edx
> 00002 85 c9 test ecx, ecx
> 00004 0f 9c c2 setl dl
> 00007 8b c2 mov eax, edx
> 00009 f7 d8 neg eax
> 0000b 33 c1 xor eax, ecx
> 0000d 03 c2 add eax, edx
> 0000f c3 ret 0
>?abs@@YIHH@Z ENDP ; abs
>
>Since i don't own a compiler producing cmovs - i unfortunately only played with
>cmov in inlined inline assembly for max, min, abs (abs is rarely used in my
>program) and the drawbacks with passing arguments already in registers via stack
>- as you may imagine, it was considerable slower.
>
>As you may know - conditional write is also a nice trick to avoid branches -
>specially with amd64 write combining. Conditional index increment for a target
>array. Agner Fog's hint otoh with rep movs (memcpy?) and ecx one or zero don't
>looks so promising for amd64.
>
>Cheers,
>Gerd
>
>
>>
>>Of course mainly at AMD, as it's easier to get a quad opteron for a tournament,
>>than it is to get a quad Xeon. I fear that'll be the case in 2007 too.
>>
>>However, those opteron chips are there now and a compiler generating fast x64
>>code is not there. Simply because there is only a microsoft compiler and
>>microsofts nickname here is wintel.
>>
>>As this conditional move is fast for the Israeli processor line and Xeon group
>>will release such a pentium-m dual core xeon doing 4 instructions a cycle end of
>>2006, not to confuse with the dual core p4 xeon that releases start of 2006 on
>>paper; does this mean by 2007 'suddenly' the microsoft compiler can do
>>conditional moves at x64 by 2007 somewhere?
>>
>>Or is the same problem there at pentium-m with their medium sized L1 cache (only
>>32KB data) and probably inferior L2 cache compared to AMD.
>>
>>Of course from multiprocessing viewpoint, sharing that L2 cache is a bad thing
>>for pentium-m. dual core opteron isn't doing that of course. So scaling at
>>opteron should be much better for crafty, diep, zappa and the baron. Basically
>>the dual core intel we should see as a single core with improved hyperthreading.
>>Perhaps even scales 70%+.
>>
>>However the raw speed of a single core xeon should be way faster, so total speed
>>at the cpu should be significant faster than dual core opteron.
>>
>>Vincent
>>
>>>Thanks,
>>>Eugene
>>>
>>>>>
>>>>>So i assume intel EM64T became slower and as a result of that it was abandonned?
>>>>>
>>>>>Vincent
>>>>>
>>>>>>I suspect there are several reasons for this:
>>>>>>* branch predictors are good, and majority of branches can be correctly
>>>>>>predicted
>>>>>>* CMOV is long instruction; short branch is shorter, so program with less CMOVs
>>>>>>fits better into cache
>>>>>>* there is no 8-bit form of CMOV
>>>>>>* CMOV has no "CMOV reg, immediate" form; if you need it you first have to load
>>>>>>immediate into register, this executing more instructions and increasing
>>>>>>register pressure -- serious problem on x86
>>>>>>* for invalid address "CMOV reg, memory" will give you access violation even if
>>>>>>condition is false.
>>>>>>
>>>>>>Thanks,
>>>>>>Eugene
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.