Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: some quotes on switch and indirect branches

Author: Vincent Diepeveen

Date: 07:48:43 11/25/05

Go up one level in this thread


On November 24, 2005 at 14:41:16, Gerd Isenberg wrote:

>On November 24, 2005 at 07:50:47, Vincent Diepeveen wrote:
>
>>On November 23, 2005 at 22:34:13, Eugene Nalimov wrote:
>>
>>>On November 23, 2005 at 21:32:18, Robert Hyatt wrote:
>>>
>>>>On November 23, 2005 at 15:51:01, Vincent Diepeveen wrote:
>>>>
>>>>>On November 21, 2005 at 20:00:24, Eugene Nalimov wrote:
>>>>>
>>>>>>On November 21, 2005 at 18:10:54, Dieter Buerssner wrote:
>>>>>>
>>>>>>>[...]
>>>>>>>I guess, you mean this as a substitution for
>>>>>>>  if (depth < 0)
>>>>>>>    fm = fm1;
>>>>>>>  else
>>>>>>>    fm = fm2;
>>>>>>>
>>>>>>>I am surprised, that compilers are not able to do this themselves. I
>>>>>>
>>>>>>I several times tried to modify Visual C to recognize additional cases where we
>>>>>>should emit conditional moves (last time was probably a year ago for
>>>>>>x64-targeting compiler). Every time I could demonstrate win on a small
>>>>>>artificial test case, but every large real world program either showed no gain
>>>>>>or slowed down.
>>>>>
>>>>>At *which* processor was it slower. AMD or EM64T?
>>>>>
>>>>>AMD has quite a big L1 cache and has instruction cache in L2 if i understand
>>>>>correctly. That should make larger code sizes no problem.
>>>>
>>>>Last I saw, AMD64 had a unified L2.  typical split L1.  Have not seen a split L2
>>>>machine that I am aware of (although one could exist...)
>>>
>>>New Itanium that will be released next year (should be this quarter, but
>>>slipped) has 1Mb of L2 I-cache in addition to 256Kb of L2 D-cache.
>>
>>A sunken itanium ship with a price of like 8000 dollar a chip, for a chip not
>>yet there, but only if you buy a 1000 of them, otherwise it'll be more like
>>20000 dollar a chip, is not a good compare. Despite intel giving masses of those
>>chips for free to SGI, SGI despite all that still has been removed from the New
>>York Stock Exchange. It's no longer there.
>>
>>x64 is more interesting.
>
>Agreed.
>
>
>>
>>Allows Gerd for example to do postings with assembly code. I'll have to see Gerd
>>ever write assembly for IA64. It's very complex to write assembly for IA64.
>
>
>Exactly - i only once saw some generated IA64 assembly posted here by Eugene,
>where he explaned how IA64 did some speculative branch execution - iirc.
>
>
>>
>>Eugene, you mentionned x64 and you'll never ever in your life will make a typing
>>mistake meaning IA64 when writing down x64.
>>
>>So i conclude you tested only for intel hardware the conditional move
>>instructions and not so very careful for AMD, and that conditional moves were
>>thrown out because they were slow on intel and *not* because they were slow on
>>AMD.
>>
>>On paper it's 1 cycle direct path (practical 2) at AMD, versus practical 7 for
>>intel.
>>
>>Now in his bitboard thing, Gerd perhaps doesn't have the right data on his hands
>>to execute conditional moves. In DIEP i have. And i have a zillion branches in
>>diep and code size is already huge. So removing branches has priority.
>
>
>Hehe, yes - with zillions branches you will pollute your btb and other branch
>prediction ressources.
>
>What about some cmovs enabled by pogo - if profiler claims too many
>mispredictions for jxx.
>
>If you have a lot of code in your eval and elsewhere, where you simply do some
>conditional add or sub - and the condition is rather random and therefor likely
>to produce some mispredictions with this very short bodies - you are free to use
>the boolean*int multiplication on C level. For Less/greater compares you are
>able to convert the compare expression to (arithmeticExpression < 0) and to do a
>shift arithmetic right with arithmeticExpression and (sizeof(int)*8-1),  to get
>a {-1,0}-range - a mask to apply the boolean multiplication by bitwise and.
>
>From the initial conditional statement ...
>
>if ( distance1 < distance2 )
>   eval += bonus;
>
>... via < 0 ...
>
>if ( distance1 - distance2 < 0 )
>   eval += bonus;
>
>... to the final sub(lea)-sar-and expression:
>
>eval += ( (distance1 - distance2) >> 31 ) & bonus;

Here and there i'm actually using that already. Very good
that you mention it again, Gerd.

I should use it perhaps more. Didn't systematically use it yet. Always
considered compilers to solve my problem.

Perhaps i should have less trust in Nalimov's CMOV implementation getting in the
main source code tree for visual x64 2006 :)

>
>That's really cheap, if P4 don't cares.

Forget Prescott.

>It requires only one extra register - while cmov requires two.
>If one argument is a constant, while the other is scaled by 2,4,8 it is
>still one lea instruction to produce the difference in a register.
>(Otherwise i prefere mov-add/sub because lea is two cycles on AMD64).
>
>Of course it is weird to sacrifice readability to avoid branches!

That's one of the main concerns in my code and one of the most important reason
why i have so many branches.

Like if you have code like this (randomly written down example):

 if( quickboard[sq_e3] == brook && quickboard[sq_e4] == bpawn
  && quickboard[sq_e5] == bking )
   s += 5;

You can of course consider doing this:

  int orall = ((quickboard[sq_e3]-brook)
               |(quickboard[sq_e4]-bpawn)
               |(quickboard[sq_e5]-bking)
              );

  if( !orall ) s += 5;

Generating a CMOV by that.

Avoiding a near sure misprediction somewhere.
Like branch 1 might have normally spoken a prediction ratio of 85% to get
not taken, still that means 15% gets misprediction rate which hurts *badly*.

p.s. how to apply a shifting trick to the above?

>So for better readability and maintainability i suggest some macros or inliners:

>__forceinline
>int ifAlessBthenCelse0(int a, int b, int c) { ((a-b)>>31) & c; }
>
>eval += ifAlessBthenCelse0(distance1, distance2, bonus);-)

bah, putting for once comments with your code would be more helpful :)

>Take care not accidently to violate the us. patent by sun!

:)

>The "Apparatus for directing a parallel processing computing device to form an
>absolute value of a signed value":
>
>abs(x) ::= ( x ^ (x>>31) ) - (x>>31)

#ifdef INTEGERIS32BITS
abs(x) :: = x & 0x7fffffff; // 32 bits int
#else
..

>
>The {-1,0}-mask is used here to build the one's complement if negative, while
>subtract minus one does the two's complement.
>
>http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=/netahtml/search-adv.htm&r=1&f=G&l=50&d=ptxt&S1=6073150&OS=6073150&RS=6073150
>
>May be intel's weird shift implementation in P4 was the revenge ;-)

:)

>What about to force cdq for x86 in C by casting to long long or __int64 and then
>take the high dword where the sign extension occured as the mask? Does that
>violate the patent?
>
>int abs(int x) {
>  int m = (int)( (unsigned __int64)((__int64)x) >> 32 );
>  return (x ^ m) - m;
>}

honestely, 64 bits is so slow and all those quads have server2003 installed
which is 32 bits, so having a program that's 32 bits code is really advised.

then you can show up quad every tournament...

>and the generated assembly:
>
>?abs@@YIHH@Z PROC NEAR			; abs, COMDAT
>; _x$ = ecx
>  00000	8b c1		 mov	 eax, ecx
>  00002	99		 cdq
>  00003	8b c2		 mov	 eax, edx
>  00005	33 c1		 xor	 eax, ecx
>  00007	2b c2		 sub	 eax, edx
>  00009	c3		 ret	 0
>?abs@@YIHH@Z ENDP			; abs
>
>Or what about this?
>
>int abs(int x) {return (x ^ -(x < 0)) + (x < 0);}
>
>?abs@@YIHH@Z PROC NEAR			; abs, COMDAT
>; _x$ = ecx
>  00000	33 d2		 xor	 edx, edx
>  00002	85 c9		 test	 ecx, ecx
>  00004	0f 9c c2	 setl	 dl
>  00007	8b c2		 mov	 eax, edx
>  00009	f7 d8		 neg	 eax
>  0000b	33 c1		 xor	 eax, ecx
>  0000d	03 c2		 add	 eax, edx
>  0000f	c3		 ret	 0
>?abs@@YIHH@Z ENDP			; abs
>
>Since i don't own a compiler producing cmovs - i unfortunately only played with
>cmov in inlined inline assembly for max, min, abs (abs is rarely used in my
>program) and the drawbacks with passing arguments already in registers via stack
>- as you may imagine, it was considerable slower.
>
>As you may know - conditional write is also a nice trick to avoid branches -
>specially with amd64 write combining. Conditional index increment for a target
>array. Agner Fog's hint otoh with rep movs (memcpy?) and ecx one or zero don't
>looks so promising for amd64.
>
>Cheers,
>Gerd
>
>
>>
>>Of course mainly at AMD, as it's easier to get a quad opteron for a tournament,
>>than it is to get a quad Xeon. I fear that'll be the case in 2007 too.
>>
>>However, those opteron chips are there now and a compiler generating fast x64
>>code is not there. Simply because there is only a microsoft compiler and
>>microsofts nickname here is wintel.
>>
>>As this conditional move is fast for the Israeli processor line and Xeon group
>>will release such a pentium-m dual core xeon doing 4 instructions a cycle end of
>>2006, not to confuse with the dual core p4 xeon that releases start of 2006 on
>>paper; does this mean by 2007 'suddenly' the microsoft compiler can do
>>conditional moves at x64 by 2007 somewhere?
>>
>>Or is the same problem there at pentium-m with their medium sized L1 cache (only
>>32KB data) and probably inferior L2 cache compared to AMD.
>>
>>Of course from multiprocessing viewpoint, sharing that L2 cache is a bad thing
>>for pentium-m. dual core opteron isn't doing that of course. So scaling at
>>opteron should be much better for crafty, diep, zappa and the baron. Basically
>>the dual core intel we should see as a single core with improved hyperthreading.
>>Perhaps even scales 70%+.
>>
>>However the raw speed of a single core xeon should be way faster, so total speed
>>at the cpu should be significant faster than dual core opteron.
>>
>>Vincent
>>
>>>Thanks,
>>>Eugene
>>>
>>>>>
>>>>>So i assume intel EM64T became slower and as a result of that it was abandonned?
>>>>>
>>>>>Vincent
>>>>>
>>>>>>I suspect there are several reasons for this:
>>>>>>* branch predictors are good, and majority of branches can be correctly
>>>>>>predicted
>>>>>>* CMOV is long instruction; short branch is shorter, so program with less CMOVs
>>>>>>fits better into cache
>>>>>>* there is no 8-bit form of CMOV
>>>>>>* CMOV has no "CMOV reg, immediate" form; if you need it you first have to load
>>>>>>immediate into register, this executing more instructions and increasing
>>>>>>register pressure -- serious problem on x86
>>>>>>* for invalid address "CMOV reg, memory" will give you access violation even if
>>>>>>condition is false.
>>>>>>
>>>>>>Thanks,
>>>>>>Eugene



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.