Author: Gerd Isenberg
Date: 14:45:50 06/01/04
Go up one level in this thread
On June 01, 2004 at 15:28:33, Dieter Buerssner wrote: >On May 31, 2004 at 10:07:42, Gerd Isenberg wrote: > >>I tried a MMX-version based on the "dead slow" popcount of amd's optimization >>manual, with the eight add,major pairs. Even that takes about 42ns -> 2.1 >>ns/32-bit. To get an idea what is possible with AMD64 gp registers! > >Hi Gerd, >you won :-) I get 2.4 ns on my P4 2.53 GHz with your code. I cannot beat this >with my means: without assembly - even with my assembly knowledge, that predates >MMX instructions, this would probably be impossible to beat. I actually think, >the compilers produce pretty good assembly from my C-code (in this case) >already. I guess, coding my routine with MMX instructions would have a chance. >One has to "double" the masks, and use 64-bit registers, and add one stage >(which should not cost much). The first stages would be done 3 times on 3 64-bit >words, instead of 6 times, then, and one "odd" 64-bit word. Perhaps I am going >to try it, following your code. Also, on real 64-bit environments, I think my >idea should almost yield in double speed (but not totally doubled, because of >the one added round in the algorithm) compared to the 32-bit algorithm. Your >original code with maj/odd should at least double speed, however. > >BTW. It was not without pitfalls, to try your code. This was the first time, I >tried MMX inline. In my timing prog, the outputs first were wrong. This was, >because I used floating point for times, and this did not mix with the mmx >instructions. So, I had to find out, to add _mm_empty() at the right place. I >used the free VC command line tools, for the tests, now (but times for other >functions discussed, did not really change to VC6). > >Cheers, >Dieter Hi Dieter, Your "horizontal" optimization with 32-bit popcounts is rather great too, so i would consider us both as winners ;-) Sorry that i did not mention the emms/femms instruction to save/restore shared x87 registers and the use of the 3DNow psadbw instruction to build the final byte sum. I am a bit surprised about the mmx performance of your P4. I had worse experience with an older P4 2GHz with fill routines with similar independent instruction chains, iirc it was about two times slower than my former 1.4GHz athlon. Btw. one needs only four odd/maj pairs to pack seven words into three binary digit words: one1,two1 := oddMaj(x1,x2,x3) one2,two2 := oddMaj(x4,x5,x6) ones,two3 := oddMaj(x7,one1,one2) twos,four := oddMaj(two1,two2,two3) But already 11 pairs to pack 15 to 4. one1,two1 := oddMaj(x1,x2,x3) one2,two2 := oddMaj(x4,x5,x6) one3,two3 := oddMaj(x7,x8,x9) one4,two4 := oddMaj(x10,x11,x12) one5,two5 := oddMaj(x13,x14,x15) one6,two6 := oddMaj(one1,one2,one3) ones,two7 := oddMaj(one4,one5,one6) two8,four1 := oddMaj(two1,two2,two3) two9,four2 := oddMaj(two4,two5,two6) twos,four3 := oddMaj(two7,two8,two9) four,eight := oddMaj(four1,four2,four3) Hmm, 1,3,11 pairs for 2**2-1, 2**3-1 and 2**4-1 words... Too lazy to look for a general formula nPairs(N) now ;-( Anyway thanks for your interest and the instructive thread. Cheers, Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.