Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Population of disjoint Attacksets

Author: Gerd Isenberg

Date: 14:45:50 06/01/04

Go up one level in this thread


On June 01, 2004 at 15:28:33, Dieter Buerssner wrote:

>On May 31, 2004 at 10:07:42, Gerd Isenberg wrote:
>
>>I tried a MMX-version based on the "dead slow" popcount of amd's optimization
>>manual, with the eight add,major pairs. Even that takes about 42ns -> 2.1
>>ns/32-bit. To get an idea what is possible with AMD64 gp registers!
>
>Hi Gerd,
>you won :-) I get 2.4 ns on my P4 2.53 GHz with your code. I cannot beat this
>with my means: without assembly - even with my assembly knowledge, that predates
>MMX instructions, this would probably be impossible to beat. I actually think,
>the compilers produce pretty good assembly from my C-code (in this case)
>already. I guess, coding my routine with MMX instructions would have a chance.
>One has to "double" the masks, and use 64-bit registers, and add one stage
>(which should not cost much). The first stages would be done 3 times on 3 64-bit
>words, instead of 6 times, then, and one "odd" 64-bit word. Perhaps I am going
>to try it, following your code. Also, on real 64-bit environments, I think my
>idea should almost yield in double speed (but not totally doubled, because of
>the one added round in the algorithm) compared to the 32-bit algorithm. Your
>original code with maj/odd should at least double speed, however.
>
>BTW. It was not without pitfalls, to try your code. This was the first time, I
>tried MMX inline. In my timing prog, the outputs first were wrong. This was,
>because I used floating point for times, and this did not mix with the mmx
>instructions. So, I had to find out, to add _mm_empty() at the right place. I
>used the free VC command line tools, for the tests, now (but times for other
>functions discussed, did not really change to VC6).
>
>Cheers,
>Dieter

Hi Dieter,

Your "horizontal" optimization with 32-bit popcounts is rather great too, so i
would consider us both as winners ;-)

Sorry that i did not mention the emms/femms instruction to save/restore shared
x87 registers and the use of the 3DNow psadbw instruction to build the final
byte sum.

I am a bit surprised about the mmx performance of your P4. I had worse
experience with an older P4 2GHz with fill routines with similar independent
instruction chains, iirc it was about two times slower than my former 1.4GHz
athlon.

Btw. one needs only four odd/maj pairs to pack seven words into three binary
digit words:

one1,two1 := oddMaj(x1,x2,x3)
one2,two2 := oddMaj(x4,x5,x6)
ones,two3 := oddMaj(x7,one1,one2)
twos,four := oddMaj(two1,two2,two3)

But already 11 pairs to pack 15 to 4.

one1,two1  := oddMaj(x1,x2,x3)
one2,two2  := oddMaj(x4,x5,x6)
one3,two3  := oddMaj(x7,x8,x9)
one4,two4  := oddMaj(x10,x11,x12)
one5,two5  := oddMaj(x13,x14,x15)

one6,two6  := oddMaj(one1,one2,one3)
ones,two7  := oddMaj(one4,one5,one6)

two8,four1 := oddMaj(two1,two2,two3)
two9,four2 := oddMaj(two4,two5,two6)
twos,four3 := oddMaj(two7,two8,two9)
four,eight := oddMaj(four1,four2,four3)

Hmm, 1,3,11 pairs for 2**2-1, 2**3-1 and 2**4-1 words...
Too lazy to look for a general formula nPairs(N) now ;-(
Anyway thanks for your interest and the instructive thread.

Cheers,
Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.