Author: Eugene Nalimov
Date: 22:04:08 08/20/00
Go up one level in this thread
On August 20, 2000 at 15:02:30, Brian Richardson wrote: >On August 20, 2000 at 14:58:59, Brian Richardson wrote: > >>On August 17, 2000 at 23:54:33, Eugene Nalimov wrote: >> >>>(1) I already posted that shift by the variable amount have a huge latency on >>>the Itanium, so that there would be no win -- at least no win in terms of clock >>>cycles (yes, function probably would be smaller, but not faster). And you'll >>>need assembly for that -- code I posted is 100% C. >>>(2) I wrote the "original" x86 code (x86.s), after that somebody converted it >>>from Linux-on-x86-asm to everybody-else-on-x86-asm and moved it into vcinline.h. >>>Also, FirstOne()/LastOne() were rewritten, as P6/PII/PIII has fast BSR/BSF >>>instructions, so it's beneficial to use them (on original Pentium they were >>>terrible slow). >>> >>>Eugene >>> >>>On August 17, 2000 at 17:23:45, Brian Richardson wrote: >>> >>>>Eugene: First thank you for your work on EGTBs. Second, thanks for looking >>>>into the IA-64 coding issues. Had you considered the IA-64 instruction that >>>>finds the first non-zero byte in operands of various sizes, and then using that >>>>to index an 8-bit array? Perhaps it is too slow vs your method with may exploit >>>>more parallelism. >>>> >>>>Brian >>>> >>>>PS Did you write vcinline.h for Crafty? >> >>Thank you for your reply. >>I was just wondering if you had considered the Compute Zero Index (CZX) >>instruction to find the first non-zero byte and then index lookup that with an >>array preset with first/last bits... >> >>Brian > >PS I do not know if the CZX instruction incurs the same shift latency you were >referring to above, or perhaps a subsequent byte load using the result of the >CZX index does... No. Latency of CZX is 2 clock cycles (from memory, so maybe I am wrong). But after that you need shift by the result of CZX * 8 to get that non-zero byte into low 8 bits -- and latency of that shift is 4. So the resulting code would be something like this: CZX r1=rArg;; SHLADD r2=r1, 3, r0;; SHL r3=rArg, r2;; ZXT r3=r3;; ADD r3=address off the first_one_8bits;; LD1 r8=[r3];; ADD r8=r2;; Total time is 12 clock cycles -- exactly as plain C vesrion. Eugene
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.