Computer Chess Club Archives


Search

Terms

Messages

Subject: Why Shift after CZX Re: CZX IA-64 Instruction

Author: Brian Richardson

Date: 16:29:42 08/21/00

Go up one level in this thread


On August 21, 2000 at 01:04:08, Eugene Nalimov wrote:

>On August 20, 2000 at 15:02:30, Brian Richardson wrote:
>
>>On August 20, 2000 at 14:58:59, Brian Richardson wrote:
>>
>>>On August 17, 2000 at 23:54:33, Eugene Nalimov wrote:
>>>
>>>>(1) I already posted that shift by the variable amount have a huge latency on
>>>>the Itanium, so that there would be no win -- at least no win in terms of clock
>>>>cycles (yes, function probably would be smaller, but not faster). And you'll
>>>>need assembly for that -- code I posted is 100% C.
>>>>(2) I wrote the "original" x86 code (x86.s), after that somebody converted it
>>>>from Linux-on-x86-asm to everybody-else-on-x86-asm and moved it into vcinline.h.
>>>>Also, FirstOne()/LastOne() were rewritten, as P6/PII/PIII has fast BSR/BSF
>>>>instructions, so it's beneficial to use them (on original Pentium they were
>>>>terrible slow).
>>>>
>>>>Eugene
>>>>
>>>>On August 17, 2000 at 17:23:45, Brian Richardson wrote:
>>>>
>>>>>Eugene:  First thank you for your work on EGTBs.  Second, thanks for looking
>>>>>into the IA-64 coding issues.  Had you considered the IA-64 instruction that
>>>>>finds the first non-zero byte in operands of various sizes, and then using that
>>>>>to index an 8-bit array?  Perhaps it is too slow vs your method with may exploit
>>>>>more parallelism.
>>>>>
>>>>>Brian
>>>>>
>>>>>PS  Did you write vcinline.h for Crafty?
>>>
>>>Thank you for your reply.
>>>I was just wondering if you had considered the Compute Zero Index (CZX)
>>>instruction to find the first non-zero byte and then index lookup that with an
>>>array preset with first/last bits...
>>>
>>>Brian
>>
>>PS  I do not know if the CZX instruction incurs the same shift latency you were
>>referring to above, or perhaps a subsequent byte load using the result of the
>>CZX index does...
>
>No. Latency of CZX is 2 clock cycles (from memory, so maybe I am wrong). But
>after that you need shift by the result of CZX * 8 to get that non-zero byte
>into low 8 bits -- and latency of that shift is 4.
>
>So the resulting code would be something like this:
>    CZX    r1=rArg;;
>    SHLADD r2=r1, 3, r0;;
>    SHL    r3=rArg, r2;;
>    ZXT    r3=r3;;
>    ADD    r3=address off the first_one_8bits;;
>    LD1    r8=[r3];;
>    ADD    r8=r2;;
>
>Total time is 12 clock cycles -- exactly as plain C vesrion.
>
>Eugene

I am not very famililar with the various forms of the LOAD instruction, but
couldn't the CZX result be just the byte offset for a one byte load instruction
(which I assume would put the byte in the low-order position and zero extend, as
an alternative to shifting, or perhaps the EXTRACT instruction), to another
register that is then used to index the 8bits array?  Of course, this might not
be faster.
Brian



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.