Author: Matt Taylor
Date: 23:01:26 01/23/03
Go up one level in this thread
It was a nice idea. It works well in theory, but it doesn't work quite as well as other methods in practice for two reasons: indexing and memory latency. It is generally best on a machine to use word-size indices to access memory to avoid this problem. To generate the pointer to load from, the machine has to mask (movzx in the case of x86 -- to preserve original value), add, and then load. However, it is also not possible to fit a table for a word-sized index into memory. The other problem is memory latency. Assuming only cache hits to the L1 cache, this is 2-3 cycles of latency on the top-of-the-line chips. The memory accesses are independent and can be overlapped -- only as long as there is no dependency. This means that each byte flip is going to eat at least 3 cycles. 3x4 (assuming perfect overlapping between the two 64-bit halves) is 12 cycles + setup latency. The logarithmic method I implemented and posted in assembly is 20 clocks, and I'm quite sure the time can be reduced at least to 16 clocks or so on Athlon. Practically, this is about the same as your table index routine, but it doesn't use extra memory. Making use of 2 bswaps to reverse the bytes, you only need a register swap + 3 iterations of the logarithmic reverse (to reverse each byte). The register swap disappears on x86-64, and more parallelism might be extracted to make it even faster. (Trickey -should- allow the clever x86 assembly programmer to get the swap for free!) If anyone is -really- interested, I'll optimize the bit reverse function further. For Athlon, of course. -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.