Author: Vincent Diepeveen
Date: 07:48:43 11/25/05
Go up one level in this thread
On November 24, 2005 at 14:41:16, Gerd Isenberg wrote:
>On November 24, 2005 at 07:50:47, Vincent Diepeveen wrote:
>
>>On November 23, 2005 at 22:34:13, Eugene Nalimov wrote:
>>
>>>On November 23, 2005 at 21:32:18, Robert Hyatt wrote:
>>>
>>>>On November 23, 2005 at 15:51:01, Vincent Diepeveen wrote:
>>>>
>>>>>On November 21, 2005 at 20:00:24, Eugene Nalimov wrote:
>>>>>
>>>>>>On November 21, 2005 at 18:10:54, Dieter Buerssner wrote:
>>>>>>
>>>>>>>[...]
>>>>>>>I guess, you mean this as a substitution for
>>>>>>> if (depth < 0)
>>>>>>> fm = fm1;
>>>>>>> else
>>>>>>> fm = fm2;
>>>>>>>
>>>>>>>I am surprised, that compilers are not able to do this themselves. I
>>>>>>
>>>>>>I several times tried to modify Visual C to recognize additional cases where we
>>>>>>should emit conditional moves (last time was probably a year ago for
>>>>>>x64-targeting compiler). Every time I could demonstrate win on a small
>>>>>>artificial test case, but every large real world program either showed no gain
>>>>>>or slowed down.
>>>>>
>>>>>At *which* processor was it slower. AMD or EM64T?
>>>>>
>>>>>AMD has quite a big L1 cache and has instruction cache in L2 if i understand
>>>>>correctly. That should make larger code sizes no problem.
>>>>
>>>>Last I saw, AMD64 had a unified L2. typical split L1. Have not seen a split L2
>>>>machine that I am aware of (although one could exist...)
>>>
>>>New Itanium that will be released next year (should be this quarter, but
>>>slipped) has 1Mb of L2 I-cache in addition to 256Kb of L2 D-cache.
>>
>>A sunken itanium ship with a price of like 8000 dollar a chip, for a chip not
>>yet there, but only if you buy a 1000 of them, otherwise it'll be more like
>>20000 dollar a chip, is not a good compare. Despite intel giving masses of those
>>chips for free to SGI, SGI despite all that still has been removed from the New
>>York Stock Exchange. It's no longer there.
>>
>>x64 is more interesting.
>
>Agreed.
>
>
>>
>>Allows Gerd for example to do postings with assembly code. I'll have to see Gerd
>>ever write assembly for IA64. It's very complex to write assembly for IA64.
>
>
>Exactly - i only once saw some generated IA64 assembly posted here by Eugene,
>where he explaned how IA64 did some speculative branch execution - iirc.
>
>
>>
>>Eugene, you mentionned x64 and you'll never ever in your life will make a typing
>>mistake meaning IA64 when writing down x64.
>>
>>So i conclude you tested only for intel hardware the conditional move
>>instructions and not so very careful for AMD, and that conditional moves were
>>thrown out because they were slow on intel and *not* because they were slow on
>>AMD.
>>
>>On paper it's 1 cycle direct path (practical 2) at AMD, versus practical 7 for
>>intel.
>>
>>Now in his bitboard thing, Gerd perhaps doesn't have the right data on his hands
>>to execute conditional moves. In DIEP i have. And i have a zillion branches in
>>diep and code size is already huge. So removing branches has priority.
>
>
>Hehe, yes - with zillions branches you will pollute your btb and other branch
>prediction ressources.
>
>What about some cmovs enabled by pogo - if profiler claims too many
>mispredictions for jxx.
>
>If you have a lot of code in your eval and elsewhere, where you simply do some
>conditional add or sub - and the condition is rather random and therefor likely
>to produce some mispredictions with this very short bodies - you are free to use
>the boolean*int multiplication on C level. For Less/greater compares you are
>able to convert the compare expression to (arithmeticExpression < 0) and to do a
>shift arithmetic right with arithmeticExpression and (sizeof(int)*8-1), to get
>a {-1,0}-range - a mask to apply the boolean multiplication by bitwise and.
>
>From the initial conditional statement ...
>
>if ( distance1 < distance2 )
> eval += bonus;
>
>... via < 0 ...
>
>if ( distance1 - distance2 < 0 )
> eval += bonus;
>
>... to the final sub(lea)-sar-and expression:
>
>eval += ( (distance1 - distance2) >> 31 ) & bonus;
Here and there i'm actually using that already. Very good
that you mention it again, Gerd.
I should use it perhaps more. Didn't systematically use it yet. Always
considered compilers to solve my problem.
Perhaps i should have less trust in Nalimov's CMOV implementation getting in the
main source code tree for visual x64 2006 :)
>
>That's really cheap, if P4 don't cares.
Forget Prescott.
>It requires only one extra register - while cmov requires two.
>If one argument is a constant, while the other is scaled by 2,4,8 it is
>still one lea instruction to produce the difference in a register.
>(Otherwise i prefere mov-add/sub because lea is two cycles on AMD64).
>
>Of course it is weird to sacrifice readability to avoid branches!
That's one of the main concerns in my code and one of the most important reason
why i have so many branches.
Like if you have code like this (randomly written down example):
if( quickboard[sq_e3] == brook && quickboard[sq_e4] == bpawn
&& quickboard[sq_e5] == bking )
s += 5;
You can of course consider doing this:
int orall = ((quickboard[sq_e3]-brook)
|(quickboard[sq_e4]-bpawn)
|(quickboard[sq_e5]-bking)
);
if( !orall ) s += 5;
Generating a CMOV by that.
Avoiding a near sure misprediction somewhere.
Like branch 1 might have normally spoken a prediction ratio of 85% to get
not taken, still that means 15% gets misprediction rate which hurts *badly*.
p.s. how to apply a shifting trick to the above?
>So for better readability and maintainability i suggest some macros or inliners:
>__forceinline
>int ifAlessBthenCelse0(int a, int b, int c) { ((a-b)>>31) & c; }
>
>eval += ifAlessBthenCelse0(distance1, distance2, bonus);-)
bah, putting for once comments with your code would be more helpful :)
>Take care not accidently to violate the us. patent by sun!
:)
>The "Apparatus for directing a parallel processing computing device to form an
>absolute value of a signed value":
>
>abs(x) ::= ( x ^ (x>>31) ) - (x>>31)
#ifdef INTEGERIS32BITS
abs(x) :: = x & 0x7fffffff; // 32 bits int
#else
..
>
>The {-1,0}-mask is used here to build the one's complement if negative, while
>subtract minus one does the two's complement.
>
>http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=/netahtml/search-adv.htm&r=1&f=G&l=50&d=ptxt&S1=6073150&OS=6073150&RS=6073150
>
>May be intel's weird shift implementation in P4 was the revenge ;-)
:)
>What about to force cdq for x86 in C by casting to long long or __int64 and then
>take the high dword where the sign extension occured as the mask? Does that
>violate the patent?
>
>int abs(int x) {
> int m = (int)( (unsigned __int64)((__int64)x) >> 32 );
> return (x ^ m) - m;
>}
honestely, 64 bits is so slow and all those quads have server2003 installed
which is 32 bits, so having a program that's 32 bits code is really advised.
then you can show up quad every tournament...
>and the generated assembly:
>
>?abs@@YIHH@Z PROC NEAR ; abs, COMDAT
>; _x$ = ecx
> 00000 8b c1 mov eax, ecx
> 00002 99 cdq
> 00003 8b c2 mov eax, edx
> 00005 33 c1 xor eax, ecx
> 00007 2b c2 sub eax, edx
> 00009 c3 ret 0
>?abs@@YIHH@Z ENDP ; abs
>
>Or what about this?
>
>int abs(int x) {return (x ^ -(x < 0)) + (x < 0);}
>
>?abs@@YIHH@Z PROC NEAR ; abs, COMDAT
>; _x$ = ecx
> 00000 33 d2 xor edx, edx
> 00002 85 c9 test ecx, ecx
> 00004 0f 9c c2 setl dl
> 00007 8b c2 mov eax, edx
> 00009 f7 d8 neg eax
> 0000b 33 c1 xor eax, ecx
> 0000d 03 c2 add eax, edx
> 0000f c3 ret 0
>?abs@@YIHH@Z ENDP ; abs
>
>Since i don't own a compiler producing cmovs - i unfortunately only played with
>cmov in inlined inline assembly for max, min, abs (abs is rarely used in my
>program) and the drawbacks with passing arguments already in registers via stack
>- as you may imagine, it was considerable slower.
>
>As you may know - conditional write is also a nice trick to avoid branches -
>specially with amd64 write combining. Conditional index increment for a target
>array. Agner Fog's hint otoh with rep movs (memcpy?) and ecx one or zero don't
>looks so promising for amd64.
>
>Cheers,
>Gerd
>
>
>>
>>Of course mainly at AMD, as it's easier to get a quad opteron for a tournament,
>>than it is to get a quad Xeon. I fear that'll be the case in 2007 too.
>>
>>However, those opteron chips are there now and a compiler generating fast x64
>>code is not there. Simply because there is only a microsoft compiler and
>>microsofts nickname here is wintel.
>>
>>As this conditional move is fast for the Israeli processor line and Xeon group
>>will release such a pentium-m dual core xeon doing 4 instructions a cycle end of
>>2006, not to confuse with the dual core p4 xeon that releases start of 2006 on
>>paper; does this mean by 2007 'suddenly' the microsoft compiler can do
>>conditional moves at x64 by 2007 somewhere?
>>
>>Or is the same problem there at pentium-m with their medium sized L1 cache (only
>>32KB data) and probably inferior L2 cache compared to AMD.
>>
>>Of course from multiprocessing viewpoint, sharing that L2 cache is a bad thing
>>for pentium-m. dual core opteron isn't doing that of course. So scaling at
>>opteron should be much better for crafty, diep, zappa and the baron. Basically
>>the dual core intel we should see as a single core with improved hyperthreading.
>>Perhaps even scales 70%+.
>>
>>However the raw speed of a single core xeon should be way faster, so total speed
>>at the cpu should be significant faster than dual core opteron.
>>
>>Vincent
>>
>>>Thanks,
>>>Eugene
>>>
>>>>>
>>>>>So i assume intel EM64T became slower and as a result of that it was abandonned?
>>>>>
>>>>>Vincent
>>>>>
>>>>>>I suspect there are several reasons for this:
>>>>>>* branch predictors are good, and majority of branches can be correctly
>>>>>>predicted
>>>>>>* CMOV is long instruction; short branch is shorter, so program with less CMOVs
>>>>>>fits better into cache
>>>>>>* there is no 8-bit form of CMOV
>>>>>>* CMOV has no "CMOV reg, immediate" form; if you need it you first have to load
>>>>>>immediate into register, this executing more instructions and increasing
>>>>>>register pressure -- serious problem on x86
>>>>>>* for invalid address "CMOV reg, memory" will give you access violation even if
>>>>>>condition is false.
>>>>>>
>>>>>>Thanks,
>>>>>>Eugene
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.