Author: Dieter Buerssner
Date: 12:28:33 06/01/04
Go up one level in this thread
On May 31, 2004 at 10:07:42, Gerd Isenberg wrote: >I tried a MMX-version based on the "dead slow" popcount of amd's optimization >manual, with the eight add,major pairs. Even that takes about 42ns -> 2.1 >ns/32-bit. To get an idea what is possible with AMD64 gp registers! Hi Gerd, you won :-) I get 2.4 ns on my P4 2.53 GHz with your code. I cannot beat this with my means: without assembly - even with my assembly knowledge, that predates MMX instructions, this would probably be impossible to beat. I actually think, the compilers produce pretty good assembly from my C-code (in this case) already. I guess, coding my routine with MMX instructions would have a chance. One has to "double" the masks, and use 64-bit registers, and add one stage (which should not cost much). The first stages would be done 3 times on 3 64-bit words, instead of 6 times, then, and one "odd" 64-bit word. Perhaps I am going to try it, following your code. Also, on real 64-bit environments, I think my idea should almost yield in double speed (but not totally doubled, because of the one added round in the algorithm) compared to the 32-bit algorithm. Your original code with maj/odd should at least double speed, however. BTW. It was not without pitfalls, to try your code. This was the first time, I tried MMX inline. In my timing prog, the outputs first were wrong. This was, because I used floating point for times, and this did not mix with the mmx instructions. So, I had to find out, to add _mm_empty() at the right place. I used the free VC command line tools, for the tests, now (but times for other functions discussed, did not really change to VC6). Cheers, Dieter
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.