Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: some quotes on switch and indirect branches

Author: Gerd Isenberg
Date: 11:41:16 11/24/05
On November 24, 2005 at 07:50:47, Vincent Diepeveen wrote:

>On November 23, 2005 at 22:34:13, Eugene Nalimov wrote:
>
>>On November 23, 2005 at 21:32:18, Robert Hyatt wrote:
>>
>>>On November 23, 2005 at 15:51:01, Vincent Diepeveen wrote:
>>>
>>>>On November 21, 2005 at 20:00:24, Eugene Nalimov wrote:
>>>>
>>>>>On November 21, 2005 at 18:10:54, Dieter Buerssner wrote:
>>>>>
>>>>>>[...]
>>>>>>I guess, you mean this as a substitution for
>>>>>>  if (depth < 0)
>>>>>>    fm = fm1;
>>>>>>  else
>>>>>>    fm = fm2;
>>>>>>
>>>>>>I am surprised, that compilers are not able to do this themselves. I
>>>>>
>>>>>I several times tried to modify Visual C to recognize additional cases where we
>>>>>should emit conditional moves (last time was probably a year ago for
>>>>>x64-targeting compiler). Every time I could demonstrate win on a small
>>>>>artificial test case, but every large real world program either showed no gain
>>>>>or slowed down.
>>>>
>>>>At *which* processor was it slower. AMD or EM64T?
>>>>
>>>>AMD has quite a big L1 cache and has instruction cache in L2 if i understand
>>>>correctly. That should make larger code sizes no problem.
>>>
>>>Last I saw, AMD64 had a unified L2.  typical split L1.  Have not seen a split L2
>>>machine that I am aware of (although one could exist...)
>>
>>New Itanium that will be released next year (should be this quarter, but
>>slipped) has 1Mb of L2 I-cache in addition to 256Kb of L2 D-cache.
>
>A sunken itanium ship with a price of like 8000 dollar a chip, for a chip not
>yet there, but only if you buy a 1000 of them, otherwise it'll be more like
>20000 dollar a chip, is not a good compare. Despite intel giving masses of those
>chips for free to SGI, SGI despite all that still has been removed from the New
>York Stock Exchange. It's no longer there.
>
>x64 is more interesting.

Agreed.


>
>Allows Gerd for example to do postings with assembly code. I'll have to see Gerd
>ever write assembly for IA64. It's very complex to write assembly for IA64.


Exactly - i only once saw some generated IA64 assembly posted here by Eugene,
where he explaned how IA64 did some speculative branch execution - iirc.


>
>Eugene, you mentionned x64 and you'll never ever in your life will make a typing
>mistake meaning IA64 when writing down x64.
>
>So i conclude you tested only for intel hardware the conditional move
>instructions and not so very careful for AMD, and that conditional moves were
>thrown out because they were slow on intel and *not* because they were slow on
>AMD.
>
>On paper it's 1 cycle direct path (practical 2) at AMD, versus practical 7 for
>intel.
>
>Now in his bitboard thing, Gerd perhaps doesn't have the right data on his hands
>to execute conditional moves. In DIEP i have. And i have a zillion branches in
>diep and code size is already huge. So removing branches has priority.


Hehe, yes - with zillions branches you will pollute your btb and other branch
prediction ressources.

What about some cmovs enabled by pogo - if profiler claims too many
mispredictions for jxx.

If you have a lot of code in your eval and elsewhere, where you simply do some
conditional add or sub - and the condition is rather random and therefor likely
to produce some mispredictions with this very short bodies - you are free to use
the boolean*int multiplication on C level. For Less/greater compares you are
able to convert the compare expression to (arithmeticExpression < 0) and to do a
shift arithmetic right with arithmeticExpression and (sizeof(int)*8-1),  to get
a {-1,0}-range - a mask to apply the boolean multiplication by bitwise and.

From the initial conditional statement ...

if ( distance1 < distance2 )
   eval += bonus;

... via < 0 ...

if ( distance1 - distance2 < 0 )
   eval += bonus;

... to the final sub(lea)-sar-and expression:

eval += ( (distance1 - distance2) >> 31 ) & bonus;

That's really cheap, if P4 don't cares.
It requires only one extra register - while cmov requires two.
If one argument is a constant, while the other is scaled by 2,4,8 it is
still one lea instruction to produce the difference in a register.
(Otherwise i prefere mov-add/sub because lea is two cycles on AMD64).

Of course it is weird to sacrifice readability to avoid branches!
So for better readability and maintainability i suggest some macros or inliners:

__forceinline
int ifAlessBthenCelse0(int a, int b, int c) { ((a-b)>>31) & c; }

eval += ifAlessBthenCelse0(distance1, distance2, bonus);-)

Take care not accidently to violate the us. patent by sun!
The "Apparatus for directing a parallel processing computing device to form an
absolute value of a signed value":

abs(x) ::= ( x ^ (x>>31) ) - (x>>31)

The {-1,0}-mask is used here to build the one's complement if negative, while
subtract minus one does the two's complement.

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=/netahtml/search-adv.htm&r=1&f=G&l=50&d=ptxt&S1=6073150&OS=6073150&RS=6073150

May be intel's weird shift implementation in P4 was the revenge ;-)

What about to force cdq for x86 in C by casting to long long or __int64 and then
take the high dword where the sign extension occured as the mask? Does that
violate the patent?

int abs(int x) {
  int m = (int)( (unsigned __int64)((__int64)x) >> 32 );
  return (x ^ m) - m;
}

and the generated assembly:

?abs@@YIHH@Z PROC NEAR			; abs, COMDAT
; _x$ = ecx
  00000	8b c1		 mov	 eax, ecx
  00002	99		 cdq
  00003	8b c2		 mov	 eax, edx
  00005	33 c1		 xor	 eax, ecx
  00007	2b c2		 sub	 eax, edx
  00009	c3		 ret	 0
?abs@@YIHH@Z ENDP			; abs

Or what about this?

int abs(int x) {return (x ^ -(x < 0)) + (x < 0);}

?abs@@YIHH@Z PROC NEAR			; abs, COMDAT
; _x$ = ecx
  00000	33 d2		 xor	 edx, edx
  00002	85 c9		 test	 ecx, ecx
  00004	0f 9c c2	 setl	 dl
  00007	8b c2		 mov	 eax, edx
  00009	f7 d8		 neg	 eax
  0000b	33 c1		 xor	 eax, ecx
  0000d	03 c2		 add	 eax, edx
  0000f	c3		 ret	 0
?abs@@YIHH@Z ENDP			; abs

Since i don't own a compiler producing cmovs - i unfortunately only played with
cmov in inlined inline assembly for max, min, abs (abs is rarely used in my
program) and the drawbacks with passing arguments already in registers via stack
- as you may imagine, it was considerable slower.

As you may know - conditional write is also a nice trick to avoid branches -
specially with amd64 write combining. Conditional index increment for a target
array. Agner Fog's hint otoh with rep movs (memcpy?) and ecx one or zero don't
looks so promising for amd64.

Cheers,
Gerd


>
>Of course mainly at AMD, as it's easier to get a quad opteron for a tournament,
>than it is to get a quad Xeon. I fear that'll be the case in 2007 too.
>
>However, those opteron chips are there now and a compiler generating fast x64
>code is not there. Simply because there is only a microsoft compiler and
>microsofts nickname here is wintel.
>
>As this conditional move is fast for the Israeli processor line and Xeon group
>will release such a pentium-m dual core xeon doing 4 instructions a cycle end of
>2006, not to confuse with the dual core p4 xeon that releases start of 2006 on
>paper; does this mean by 2007 'suddenly' the microsoft compiler can do
>conditional moves at x64 by 2007 somewhere?
>
>Or is the same problem there at pentium-m with their medium sized L1 cache (only
>32KB data) and probably inferior L2 cache compared to AMD.
>
>Of course from multiprocessing viewpoint, sharing that L2 cache is a bad thing
>for pentium-m. dual core opteron isn't doing that of course. So scaling at
>opteron should be much better for crafty, diep, zappa and the baron. Basically
>the dual core intel we should see as a single core with improved hyperthreading.
>Perhaps even scales 70%+.
>
>However the raw speed of a single core xeon should be way faster, so total speed
>at the cpu should be significant faster than dual core opteron.
>
>Vincent
>
>>Thanks,
>>Eugene
>>
>>>>
>>>>So i assume intel EM64T became slower and as a result of that it was abandonned?
>>>>
>>>>Vincent
>>>>
>>>>>I suspect there are several reasons for this:
>>>>>* branch predictors are good, and majority of branches can be correctly
>>>>>predicted
>>>>>* CMOV is long instruction; short branch is shorter, so program with less CMOVs
>>>>>fits better into cache
>>>>>* there is no 8-bit form of CMOV
>>>>>* CMOV has no "CMOV reg, immediate" form; if you need it you first have to load
>>>>>immediate into register, this executing more instructions and increasing
>>>>>register pressure -- serious problem on x86
>>>>>* for invalid address "CMOV reg, memory" will give you access violation even if
>>>>>condition is false.
>>>>>
>>>>>Thanks,
>>>>>Eugene
Re: some quotes on switch and indirect branches Vincent Diepeveen 07:48:43 11/25/05
- Re: some quotes on switch and indirect branches Gerd Isenberg 12:03:50 11/25/05
Re: some quotes on switch and indirect branches Eugene Nalimov 13:16:40 11/24/05
- Re: some quotes on switch and indirect branches Vincent Diepeveen 07:26:10 11/25/05
- Re: some quotes on switch and indirect branches Gerd Isenberg 13:35:17 11/24/05
Re: some quotes on switch and indirect branches Gerd Isenberg 12:50:58 11/24/05
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.