Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: some quotes on switch and indirect branches

Author: Vincent Diepeveen

Date: 07:26:10 11/25/05

Go up one level in this thread


On November 24, 2005 at 16:16:40, Eugene Nalimov wrote:

>On November 24, 2005 at 14:41:16, Gerd Isenberg wrote:
>
>>...
>>>A sunken itanium ship with a price of like 8000 dollar a chip, for a chip not
>>>yet there, but only if you buy a 1000 of them, otherwise it'll be more like
>>>20000 dollar a chip, is not a good compare. Despite intel giving masses of those
>>>chips for free to SGI, SGI despite all that still has been removed from the New
>>>York Stock Exchange. It's no longer there.
>>>
>>>x64 is more interesting.
>>
>>Agreed.
>
>You can buy dual-CPU Itanium2 system with SCSI disk drives and several gigabytes
>of RAM for less than $5k, so single Itanium2 CPU probably costs less than $8k
>even in smaller quantites. But I undersand that for most needs AMD64/EM64T is
>more cost effective solution.

That'll be some outdated 0.8Ghz system or so. A X2 is 2.4Ghz, under $1000 and 2
times faster than that.

Excuse me, interesting right *now* to toy with is a quad montecito 2.4Ghz,
as promised for years already by intel.

A quad montecito dual core 2.0Ghz will do too.
But release it at end of 2006 for a price of i bet around $100k?
Who needs it then?

In order to get diep run fast on itanium, it needs 24 hours of pgo. If not using
that pgo with intel c++, itanium2 is slow like a turtle from ipc viewpoint.
Similar to why i'm default not using visual c++ 2005. It's default
25% slower than net2003, as no time for pgo each time i change something tiny.

Vincent

>>...
>>
>>Hehe, yes - with zillions branches you will pollute your btb and other branch
>>prediction ressources.
>>
>>What about some cmovs enabled by pogo - if profiler claims too many
>>mispredictions for jxx.
>>
>>If you have a lot of code in your eval and elsewhere, where you simply do some
>>conditional add or sub - and the condition is rather random and therefor likely
>>to produce some mispredictions with this very short bodies - you are free to use
>>the boolean*int multiplication on C level. For Less/greater compares you are
>>able to convert the compare expression to (arithmeticExpression < 0) and to do a
>>shift arithmetic right with arithmeticExpression and (sizeof(int)*8-1),  to get
>>a {-1,0}-range - a mask to apply the boolean multiplication by bitwise and.
>>
>>From the initial conditional statement ...
>>
>>if ( distance1 < distance2 )
>>   eval += bonus;
>>
>>... via < 0 ...
>
>Oops, here is problem. Transformation is not legal for some input values.
>Programmer can do it if range of values is known not to cause overflow during
>subtraction. Compiler (usually) cannot be sure, so we cannot do this
>transformation.

In code where i see a CMOV might be useful i'm always doing things like this

 .... {
   int fastbranchok=color[sq_d4];

   if( fastbranchok )
     s += 3;
}

GCC recognizes that.

>Thanks,
>Eugene
>
>>if ( distance1 - distance2 < 0 )
>>   eval += bonus;
>>
>>... to the final sub(lea)-sar-and expression:
>>
>>eval += ( (distance1 - distance2) >> 31 ) & bonus;
>>
>>That's really cheap, if P4 don't cares.
>>It requires only one extra register - while cmov requires two.
>>If one argument is a constant, while the other is scaled by 2,4,8 it is
>>still one lea instruction to produce the difference in a register.
>>(Otherwise i prefere mov-add/sub because lea is two cycles on AMD64).
>>
>>Of course it is weird to sacrifice readability to avoid branches!
>>So for better readability and maintainability i suggest some macros or inliners:
>>
>>__forceinline
>>int ifAlessBthenCelse0(int a, int b, int c) { ((a-b)>>31) & c; }
>>
>>eval += ifAlessBthenCelse0(distance1, distance2, bonus);-)
>>
>>Take care not accidently to violate the us. patent by sun!
>>The "Apparatus for directing a parallel processing computing device to form an
>>absolute value of a signed value":
>>
>>abs(x) ::= ( x ^ (x>>31) ) - (x>>31)
>>
>>The {-1,0}-mask is used here to build the one's complement if negative, while
>>subtract minus one does the two's complement.
>>
>>http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=/netahtml/search-adv.htm&r=1&f=G&l=50&d=ptxt&S1=6073150&OS=6073150&RS=6073150
>>
>>May be intel's weird shift implementation in P4 was the revenge ;-)
>>
>>What about to force cdq for x86 in C by casting to long long or __int64 and then
>>take the high dword where the sign extension occured as the mask? Does that
>>violate the patent?
>>
>>int abs(int x) {
>>  int m = (int)( (unsigned __int64)((__int64)x) >> 32 );
>>  return (x ^ m) - m;
>>}
>>
>>and the generated assembly:
>>
>>?abs@@YIHH@Z PROC NEAR			; abs, COMDAT
>>; _x$ = ecx
>>  00000	8b c1		 mov	 eax, ecx
>>  00002	99		 cdq
>>  00003	8b c2		 mov	 eax, edx
>>  00005	33 c1		 xor	 eax, ecx
>>  00007	2b c2		 sub	 eax, edx
>>  00009	c3		 ret	 0
>>?abs@@YIHH@Z ENDP			; abs
>>
>>Or what about this?
>>
>>int abs(int x) {return (x ^ -(x < 0)) + (x < 0);}
>>
>>?abs@@YIHH@Z PROC NEAR			; abs, COMDAT
>>; _x$ = ecx
>>  00000	33 d2		 xor	 edx, edx
>>  00002	85 c9		 test	 ecx, ecx
>>  00004	0f 9c c2	 setl	 dl
>>  00007	8b c2		 mov	 eax, edx
>>  00009	f7 d8		 neg	 eax
>>  0000b	33 c1		 xor	 eax, ecx
>>  0000d	03 c2		 add	 eax, edx
>>  0000f	c3		 ret	 0
>>?abs@@YIHH@Z ENDP			; abs
>>
>>Since i don't own a compiler producing cmovs - i unfortunately only played with
>>cmov in inlined inline assembly for max, min, abs (abs is rarely used in my
>>program) and the drawbacks with passing arguments already in registers via stack
>>- as you may imagine, it was considerable slower.
>>
>>As you may know - conditional write is also a nice trick to avoid branches -
>>specially with amd64 write combining. Conditional index increment for a target
>>array. Agner Fog's hint otoh with rep movs (memcpy?) and ecx one or zero don't
>>looks so promising for amd64.
>>
>>Cheers,
>>Gerd
>>
>>
>>>
>>>Of course mainly at AMD, as it's easier to get a quad opteron for a tournament,
>>>than it is to get a quad Xeon. I fear that'll be the case in 2007 too.
>>>
>>>However, those opteron chips are there now and a compiler generating fast x64
>>>code is not there. Simply because there is only a microsoft compiler and
>>>microsofts nickname here is wintel.
>>>
>>>As this conditional move is fast for the Israeli processor line and Xeon group
>>>will release such a pentium-m dual core xeon doing 4 instructions a cycle end of
>>>2006, not to confuse with the dual core p4 xeon that releases start of 2006 on
>>>paper; does this mean by 2007 'suddenly' the microsoft compiler can do
>>>conditional moves at x64 by 2007 somewhere?
>>>
>>>Or is the same problem there at pentium-m with their medium sized L1 cache (only
>>>32KB data) and probably inferior L2 cache compared to AMD.
>>>
>>>Of course from multiprocessing viewpoint, sharing that L2 cache is a bad thing
>>>for pentium-m. dual core opteron isn't doing that of course. So scaling at
>>>opteron should be much better for crafty, diep, zappa and the baron. Basically
>>>the dual core intel we should see as a single core with improved hyperthreading.
>>>Perhaps even scales 70%+.
>>>
>>>However the raw speed of a single core xeon should be way faster, so total speed
>>>at the cpu should be significant faster than dual core opteron.
>>>
>>>Vincent
>>>
>>>>Thanks,
>>>>Eugene
>>>>
>>>>>>
>>>>>>So i assume intel EM64T became slower and as a result of that it was abandonned?
>>>>>>
>>>>>>Vincent
>>>>>>
>>>>>>>I suspect there are several reasons for this:
>>>>>>>* branch predictors are good, and majority of branches can be correctly
>>>>>>>predicted
>>>>>>>* CMOV is long instruction; short branch is shorter, so program with less CMOVs
>>>>>>>fits better into cache
>>>>>>>* there is no 8-bit form of CMOV
>>>>>>>* CMOV has no "CMOV reg, immediate" form; if you need it you first have to load
>>>>>>>immediate into register, this executing more instructions and increasing
>>>>>>>register pressure -- serious problem on x86
>>>>>>>* for invalid address "CMOV reg, memory" will give you access violation even if
>>>>>>>condition is false.
>>>>>>>
>>>>>>>Thanks,
>>>>>>>Eugene



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.