Author: Robert Hyatt
Date: 22:58:41 08/27/05
Go up one level in this thread
On August 27, 2005 at 17:08:53, Ed Schröder wrote: >On August 27, 2005 at 14:08:30, Gerd Isenberg wrote: > >>On August 27, 2005 at 08:42:09, Ed Schröder wrote: >> >>>On August 27, 2005 at 05:36:51, Gerd Isenberg wrote: >>> >>>>On August 27, 2005 at 04:34:03, Ed Schröder wrote: >>>> >>>>>On August 27, 2005 at 00:43:29, Tony Werten wrote: >>>>> >>>>>>On August 26, 2005 at 18:12:30, Ed Schröder wrote: >>>>>> >>>>>>>I am no longer up-to-date regarding the newest processors (such as the AMD-64) >>>>>>>and the internal working concerning speed, hence my question: >>>>>>> >>>>>>>Which (similar) code is faster? >>>>>>> >>>>>>> test byte ptr xxx,1 | test byte ptr xxx,1 >>>>>>> je label | mov AL,[ECX] >>>>>>> mov AL,[ECX] | je label >>>>>>> mov BL,[EDX] | mov BL,[EDX] >>>>>>> ... ........ | ... ........ >>>>>>> ... ........ | ... ........ >>>>>>>label: | label: >>>>>>> >>>>>>>Thanks in advance, >>>>>> >>>>>>Hi Ed, >>>>> >>>>>Hey Tony, >>>>> >>>>> >>>>>>probably not what you wanted to know, but the code is quite different from each >>>>>>other. >>>>>> >>>>>>If the jump condition is met 50% of the time, then the left code will execute >>>>>>the 2 moves 50% of the time for an average of 1 move per loop and the right side >>>>>>100%+50% is 1.5 moves per loop on average. >>>>>> >>>>>>Did you mean something else ? >>>>> >>>>>Yep :) >>>>> >>>>>The background of my question is the processor's capability to do 2 instructions >>>>>at the same time. Following this logic the code on the right (in principle) is >>>>>supposed to be faster. >>>>> >>>>> >>>>> >>>>>>2 BTW's: >>>>>> >>>>>>1 Depending on what you do with AL and BL, you might want to use the full >>>>>>registers by doing movzx eax,[ecx] and movzx ebx,[edx] (No penalty on new >>>>>>processors) >>>>> >>>>>That's good to know, thank you. >>>> >>>> >>>>Yes, reading bytes to partial registers is expensive anyway, 4 cycles latency. >>>> >>>>MOV reg8, mem8 8Ah mm-xxx-xxx DirectPath 4 >>>>MOV AL, mem8 A0h DirectPath 4 >>> >>>Are you saying Gerd that: >>> >>> mov EAX, mem32 is faster than mov AL,mem8 ? >> >>Yes, slightly - accordind to the optimization manual three (not 1!) cycles >>instead of four (both in 32-bit as well in 64-bit mode): > >Ok. > > >>MOV reg8, mem8 8Ah mm-xxx-xxx DirectPath 4 >>MOV reg16, mem16 8Bh mm-xxx-xxx DirectPath 4 >>MOV reg32/64, mem32/64 8Bh mm-xxx-xxx DirectPath 3 > >So why are chess engines still using 8-bit boards and tables? > >He he he.... clock cycles are not the only component of performance. :) cache footprint is another issue. If you reference an 8 bit value, on the opteron you suck in 64 bytes into one cache line. If you reference a 32 bit value, you only get 16 of them, and to access all (or even scattered) values, you incur up to four cache-line fills. So you can't just look at cpi for a given instruction and make a rational decision on what to do... > > >>I confused it with the one cycle one because of mregxx, >> >>MOV reg16/32/64, mreg16/32/64 8Bh 11-xxx-xxx DirectPath 1 >> >>which latency reflects mov reg32, reg32. >> >>> >>> >>> >>>>Tony is right - zero extending to ax,eax,rax is also 4 cycles. >>>> >>>>MOVZX reg16/32/64, mem8 0Fh B6h mm-xxx-xxx DirectPath 4 >>> >>>This is clear, not much has changed. >>> >>> >>> >>>>If you have some "global", very often used array[eg. 64], it might be worth to >>>>waste some memory (eg. four cachelines instead of one) and switch to native >>>>32-bit int size: >>>> >>>>MOV reg16/32/64, mreg16/32/64 8Bh 11-xxx-xxx DirectPath 1 >>>> >>>>Also, avoid the shorter but redundant EAX-Move encoding: >>>> >>>>MOV AX/EAX/RAX, mem16/32/64 A1h DirectPath 4/3/3 >>> >>>Right, never us it. >> >>Nope, A1h mem16/32/64 move has the same latency (4/3/3) than the one byte longer >>8Bh opcode for all gp-registers. Sorry for confusing. Anyway it is usually the >>choice of the assembler or compiler, unless you code directly in machine >>language ;-) > >So it has been fixed after all, not that I see much practical use. > >Thanks Gerd. > >Ed
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.