Author: Eugene Nalimov
Date: 11:16:04 08/22/00
Go up one level in this thread
So you meant something like czx r1=rArg st8 [sp], rArg;; add r2=sp, r1;; ld1 r3=[r2];; shladd r1=r1, 3, r0 add r3=register with address of first_one_8bits;; ld1 r8=[r3];; add r8=r1;; Total 9 clock cycles. Unfortunatelly, section 4.1 "L1 Data Cache" of "Itanium Processor Microarchitecture Reference" says "Any load from an address to which a store was made within the last 3 cycles (inside any part of an aligned 64-bit region) will cause the load to bypass L1 and read from L2". To prevent that you'll have to insert extra stop bit -- i.e. the code will look like czx r1=rArg st8 [sp], rArg;; add r2=sp, r1;; shladd r1=r1, 3, r0;; ld1 r3=[r2];; add r3=register with address of first_one_8bits;; ld1 r8=[r3];; add r8=r1;; And now we have 10 clock cycles -- yes, that is slightly better than original C version (12 clock cycles). Eugene On August 22, 2000 at 09:50:28, Brian Richardson wrote: >I was thinking that a load instruction could load just the non-zero byte (base >address of the original 8 byte argument plus a register offset found by the CZX >instruction), and that the load would put that byte in the low order byte of the >register and zero fill the high order bits. That would seem to combine several >steps (I may be thinking of other hardware architectures). Then, just use that >as the offset into the pre-computed 8bits array. There was a load single byte >form, but I'm not sure about the register offset and zero filling parts. > >Alternatively, since IA-64 supports IA-32 instructions, what would the speed of >an "emulated" BSR/BSF look like on IA-64? Of course, one would still have to >handle two 32bit portions as is done now, I think. > >Thanks again for your patience and replys. > >Regards, >Brian Richardson
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.