Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: CZX IA-64 Instruction Re: Will the Itanium have a BSF or BSR instruc

Author: Eugene Nalimov

Date: 22:04:08 08/20/00

Go up one level in this thread


On August 20, 2000 at 15:02:30, Brian Richardson wrote:

>On August 20, 2000 at 14:58:59, Brian Richardson wrote:
>
>>On August 17, 2000 at 23:54:33, Eugene Nalimov wrote:
>>
>>>(1) I already posted that shift by the variable amount have a huge latency on
>>>the Itanium, so that there would be no win -- at least no win in terms of clock
>>>cycles (yes, function probably would be smaller, but not faster). And you'll
>>>need assembly for that -- code I posted is 100% C.
>>>(2) I wrote the "original" x86 code (x86.s), after that somebody converted it
>>>from Linux-on-x86-asm to everybody-else-on-x86-asm and moved it into vcinline.h.
>>>Also, FirstOne()/LastOne() were rewritten, as P6/PII/PIII has fast BSR/BSF
>>>instructions, so it's beneficial to use them (on original Pentium they were
>>>terrible slow).
>>>
>>>Eugene
>>>
>>>On August 17, 2000 at 17:23:45, Brian Richardson wrote:
>>>
>>>>Eugene:  First thank you for your work on EGTBs.  Second, thanks for looking
>>>>into the IA-64 coding issues.  Had you considered the IA-64 instruction that
>>>>finds the first non-zero byte in operands of various sizes, and then using that
>>>>to index an 8-bit array?  Perhaps it is too slow vs your method with may exploit
>>>>more parallelism.
>>>>
>>>>Brian
>>>>
>>>>PS  Did you write vcinline.h for Crafty?
>>
>>Thank you for your reply.
>>I was just wondering if you had considered the Compute Zero Index (CZX)
>>instruction to find the first non-zero byte and then index lookup that with an
>>array preset with first/last bits...
>>
>>Brian
>
>PS  I do not know if the CZX instruction incurs the same shift latency you were
>referring to above, or perhaps a subsequent byte load using the result of the
>CZX index does...

No. Latency of CZX is 2 clock cycles (from memory, so maybe I am wrong). But
after that you need shift by the result of CZX * 8 to get that non-zero byte
into low 8 bits -- and latency of that shift is 4.

So the resulting code would be something like this:
    CZX    r1=rArg;;
    SHLADD r2=r1, 3, r0;;
    SHL    r3=rArg, r2;;
    ZXT    r3=r3;;
    ADD    r3=address off the first_one_8bits;;
    LD1    r8=[r3];;
    ADD    r8=r2;;

Total time is 12 clock cycles -- exactly as plain C vesrion.

Eugene



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.