Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Expert Assembler Question

Author: Robert Hyatt

Date: 22:58:41 08/27/05

Go up one level in this thread


On August 27, 2005 at 17:08:53, Ed Schröder wrote:

>On August 27, 2005 at 14:08:30, Gerd Isenberg wrote:
>
>>On August 27, 2005 at 08:42:09, Ed Schröder wrote:
>>
>>>On August 27, 2005 at 05:36:51, Gerd Isenberg wrote:
>>>
>>>>On August 27, 2005 at 04:34:03, Ed Schröder wrote:
>>>>
>>>>>On August 27, 2005 at 00:43:29, Tony Werten wrote:
>>>>>
>>>>>>On August 26, 2005 at 18:12:30, Ed Schröder wrote:
>>>>>>
>>>>>>>I am no longer up-to-date regarding the newest processors (such as the AMD-64)
>>>>>>>and the internal working concerning speed, hence my question:
>>>>>>>
>>>>>>>Which (similar) code is faster?
>>>>>>>
>>>>>>>       test    byte ptr xxx,1    |        test    byte ptr xxx,1
>>>>>>>       je      label             |        mov     AL,[ECX]
>>>>>>>       mov     AL,[ECX]          |        je      label
>>>>>>>       mov     BL,[EDX]          |        mov     BL,[EDX]
>>>>>>>       ...     ........          |        ...     ........
>>>>>>>       ...     ........          |        ...     ........
>>>>>>>label:                           | label:
>>>>>>>
>>>>>>>Thanks in advance,
>>>>>>
>>>>>>Hi Ed,
>>>>>
>>>>>Hey Tony,
>>>>>
>>>>>
>>>>>>probably not what you wanted to know, but the code is quite different from each
>>>>>>other.
>>>>>>
>>>>>>If the jump condition is met 50% of the time, then the left code will execute
>>>>>>the 2 moves 50% of the time for an average of 1 move per loop and the right side
>>>>>>100%+50% is 1.5 moves per loop on average.
>>>>>>
>>>>>>Did you mean something else ?
>>>>>
>>>>>Yep :)
>>>>>
>>>>>The background of my question is the processor's capability to do 2 instructions
>>>>>at the same time. Following this logic the code on the right (in principle) is
>>>>>supposed to be faster.
>>>>>
>>>>>
>>>>>
>>>>>>2 BTW's:
>>>>>>
>>>>>>1 Depending on what you do with AL and BL, you might want to use the full
>>>>>>registers by doing movzx eax,[ecx] and movzx ebx,[edx] (No penalty on new
>>>>>>processors)
>>>>>
>>>>>That's good to know, thank you.
>>>>
>>>>
>>>>Yes, reading bytes to partial registers is expensive anyway, 4 cycles latency.
>>>>
>>>>MOV reg8, mem8 8Ah mm-xxx-xxx DirectPath 4
>>>>MOV AL, mem8   A0h            DirectPath 4
>>>
>>>Are you saying Gerd that:
>>>
>>> mov EAX, mem32 is faster than mov AL,mem8 ?
>>
>>Yes, slightly - accordind to the optimization manual three (not 1!) cycles
>>instead of four (both in 32-bit as well in 64-bit mode):
>
>Ok.
>
>
>>MOV reg8, mem8 8Ah mm-xxx-xxx DirectPath 4
>>MOV reg16, mem16 8Bh mm-xxx-xxx DirectPath 4
>>MOV reg32/64, mem32/64 8Bh mm-xxx-xxx DirectPath 3
>
>So why are chess engines still using 8-bit boards and tables?
>
>He he he....

clock cycles are not the only component of performance.  :)

cache footprint is another issue.  If you reference an 8 bit value, on the
opteron you suck in 64 bytes into one cache line.  If you reference a 32 bit
value, you only get 16 of them, and to access all (or even scattered) values,
you incur up to four cache-line fills.

So you can't just look at cpi for a given instruction and make a rational
decision on what to do...


>
>
>>I confused it with the one cycle one because of mregxx,
>>
>>MOV reg16/32/64, mreg16/32/64 8Bh 11-xxx-xxx DirectPath 1
>>
>>which latency reflects mov reg32, reg32.
>>
>>>
>>>
>>>
>>>>Tony is right - zero extending to ax,eax,rax is also 4 cycles.
>>>>
>>>>MOVZX reg16/32/64, mem8 0Fh B6h mm-xxx-xxx DirectPath 4
>>>
>>>This is clear, not much has changed.
>>>
>>>
>>>
>>>>If you have some "global", very often used array[eg. 64], it might be worth to
>>>>waste some memory (eg. four cachelines instead of one) and switch to native
>>>>32-bit int size:
>>>>
>>>>MOV reg16/32/64, mreg16/32/64 8Bh 11-xxx-xxx DirectPath 1
>>>>
>>>>Also, avoid the shorter but redundant EAX-Move encoding:
>>>>
>>>>MOV AX/EAX/RAX, mem16/32/64 A1h DirectPath 4/3/3
>>>
>>>Right, never us it.
>>
>>Nope, A1h mem16/32/64 move has the same latency (4/3/3) than the one byte longer
>>8Bh opcode for all gp-registers. Sorry for confusing. Anyway it is usually the
>>choice of the assembler or compiler, unless you code directly in machine
>>language ;-)
>
>So it has been fixed after all, not that I see much practical use.
>
>Thanks Gerd.
>
>Ed



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.