Author: Matt Taylor
Date: 13:19:21 01/05/03
There is truly nothing that could waste more time than trying to write -the- most optimal routine possible, but this has been much fun for me. I am using a tool now to read exact cycle counts for the AMD Athlon (K7) processor. I have some insight now into the bsf instruction and its implementation on different processors. The P5 (Pentium) and K6 processors both use microcode loops, I believe. Not completely sure, but the timings would imply it as they depend on the value of the operand. The P6 (PPro/Pentium 2/Pentium 3) implements bsf/bsr with a priority encoder, which is why they are fast. The K7 (Athlon) uses a binary search algorithm in microcode. I would presume the P7 is similar. Someone had posed the question, "How will Hammer's bsf instruction fare?" The answer is that it will be at most 1 cycle slower than the K7's bsf assuming it is implemented the same way. I am finishing up optimizing some of the proposed routines for Athlon. The bsf routine takes something like 13-21 cycles depending on which word the bit lies in. My favorite routine (binary search) optimized currently runs 15 cycles. I think I can get yet another cycle or two back from it. Chances are that it will run about as fast as the bsf instruction on Hammer. The difference is that it costs almost 128 bytes and 4 registers instead of 3 bytes and 1 register. Anyone interested in the final results can e-mail me for the optimized routines. The link I posted a while ago is no longer valid as I will be switching ISPs early next week. -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.