Author: Gerd Isenberg
Date: 11:12:34 07/06/03
Go up one level in this thread
On July 06, 2003 at 14:00:03, Omid David Tabibi wrote: >On July 06, 2003 at 13:29:56, Gerd Isenberg wrote: > >>On July 06, 2003 at 12:57:59, Dieter Buerssner wrote: >> >>>On July 06, 2003 at 05:02:50, Gerd Isenberg wrote: >>> >>> >>>>With mvc using math.h abs is fastest. With gcc cdq inline assembly abs or omids >>>>c-abs is much faster than the branching lib abs (maybe a macro from some header >>>>file?). >>> >>>Hi Gerd, as far as I can see, abs is no macro in my gcc environment. It wouldn't >>>be possible with Standard C methods, would it? Because you would not be allowed >>>to evaluate the argument twice. Of course, they could use compiler specific >>>extensions and/or inlining. I checked by precompiling the source. I think, Gcc >>>will detect abs() just like other functions (memcpy for example) and can inline >>>it directly. Ineeded I see the "simple_abs" method branch in the assembly. >>> >>>The strange thing, that omid_abs was significantly faster than nothing with MSVC >>>and rand(), do you have any idea? >> >>hmm, not really - may be because omids_abs is the only one which predicts the >>conditional loop jump correctly all the times ;-) > >That's what I thought; but apparently 'sar' costs more than the branch I tried >to evade. > maybe the sar latency is the reason that a out of order preexecuted dec edi outcome is predicted correcty. Doesn't VTune report branch misspredictions? > >> >>What about unrolling the loop a bit, eg. repeat the body statement 2..10 times. >>Doubling the speed of a function by adding additional abs code - not bad ;-) >> >>Gerd >> >> >>Here the assembly of tfunc_omid_abs >>> >>>PUBLIC @tfunc_omid_abs@0 >>>; COMDAT @tfunc_omid_abs@0 >>>_TEXT SEGMENT >>>@tfunc_omid_abs@0 PROC NEAR ; COMDAT >>>; Line 61 >>> push esi >>> push edi >>> xor esi, esi >>> mov edi, 1000000000 ; 3b9aca00H >>>$L877: >>> call _rand >>> sub eax, 16384 ; 00004000H >>> mov ecx, eax >>> sar ecx, 31 ; 0000001fH >>> mov edx, ecx >>> xor edx, eax >>> sub edx, ecx >>> add esi, edx >>> dec edi >>> jne SHORT $L877 >>> pop edi >>> mov eax, esi >>> pop esi >>> ret 0 >>>@tfunc_omid_abs@0 ENDP >>> >>>Now for tfunc_nothing >>> >>>; COMDAT @tfunc_nothing@0 >>>_TEXT SEGMENT >>>@tfunc_nothing@0 PROC NEAR ; COMDAT >>>; Line 228 >>> push esi >>> push edi >>> xor esi, esi >>> mov edi, 1000000000 ; 3b9aca00H >>>$L969: >>> call _rand >>> dec edi >>> lea esi, DWORD PTR [esi+eax-16384] >>> jne SHORT $L969 >>> pop edi >>> mov eax, esi >>> pop esi >>> ret 0 >>>@tfunc_nothing@0 ENDP >>> >>>Looks about as tight as possible. The a += rand()-16384 with one lea. >>>But also shows, that with this method and clever inlining of the compiler, >>>things are not 100% comparable. >>> >>>And tfunc_abs (library): >>> >>>PUBLIC @tfunc_abs@0 >>>; COMDAT @tfunc_abs@0 >>>_TEXT SEGMENT >>>@tfunc_abs@0 PROC NEAR ; COMDAT >>>; Line 229 >>> push esi >>> push edi >>> xor esi, esi >>> mov edi, 1000000000 ; 3b9aca00H >>>$L978: >>> call _rand >>> sub eax, 16384 ; 00004000H >>> cdq >>> xor eax, edx >>> sub eax, edx >>> add esi, eax >>> dec edi >>> jne SHORT $L978 >>> pop edi >>> mov eax, esi >>> pop esi >>> ret 0 >>>@tfunc_abs@0 ENDP >>> >>>All very similar, all should use comparable time (the time of rand()), but >>>tfunc_omid_abs is double as fast! >>> >>>Does the P4 like aligned jump lables? Can they give such extreme effects? Hard >>>to believe. >>> >>>BTW. When I >>> >>>#define RAND_VAL() ((int)n) >>> >>>to get rid of the rand() overhead (and of course also giving the branch using >>>versions an advantage), I get normal results: >>> >>> nothing 4051657984 0.811 >>> abs 4051657984 1.702 >>> simple_abs 4051657984 1.923 >>> omid_abs 4051657984 1.702 >>> sbb_abs 4051657984 4.156 >>> cdq_abs 4051657984 4.457 >>> fish_abs 4051657984 2.063 >>> sar_abs 4051657984 3.324 >>> cmovl_abs 4051657984 2.604 >>> cmovs_abs 4051657984 2.644 >>> >>>405164798 = ((1e9 * (1e9+1))/2) % 2^^32; as expected for N_ITERATIONS=1e9. >>> >>>The 0.8 s for nothing is about 2 cycles, which seems reasonable for the loop >>> >>>$L977: >>> add eax, ecx >>> dec ecx >>> jne SHORT $L977 >>> >>>Regards, >>>Dieter
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.