Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: C and C++ --- NPS

Author: Matt Taylor

Date: 19:48:35 12/26/02

Go up one level in this thread


>Hi, Matt.
>
>You've said this last thing before but I still don't understand.  Could you show
>an example of this in pseudo-asm, without using C conditionals? ("cond ? true :
>false" I understand to be simply disguised branches.)  Maybe you could base it
>on this code fragment:
>
>    int val, result;
>    ...
>    switch(val) {
>       case 0:    result = a + b + c;  break;
>       case 1:    result = a + b - c;  break;
>       case 999:  result = a - b + c;  break;
>       default:   result = a - b - c;  break;
>       }
>    foo(result);
>
>Even assuming that you can compute all the results simultaneously at a fixed
>cost, how do you "conditionally select the result" without branching?  The only
>thing I can see to do is to stuff the results in an array of size at least 1000
>and use the variable val as an index, which wouldn't seem too hot to me.
>
>I'm sure I'm missing something neat.
>
>-- Walter
>
>(I'll say more on our topic of bit flogging in a few days.)

Actually the ?: operator does not always resolve to a branch. In fact, in the
bitscan stuff, VC 7 generated awful output for the newest algorithm you had
posted when extended to 64-bits. It decided to branch; my code for the same
function didn't branch once.

In the example you have above, C code representing the assembly generated could
look like this:

int temp0, temp1, case0, case1, case999, case_default, result, default_mask;

temp0 = b - c;
temp1 = b + c;
case0 = a + temp0;
case1 = a + temp1;
case999 = a - temp0;
case_default = a - temp1;
result = 0;
default_mask = -1;

// Converge on result
result |= (val == 0 ? case0 : 0);
default_mask &= ~(val == 0);

result |= (val == 1 ? case1 : 0);
default_mask &= ~(val == 0);

result |= (val == 999 ? case999 : 0);
default_mask &= ~(val == 0);

result |= (case_default & default_mask);


You have to play some games inside the CPU, but the actual IA-32 asm would look
a bit like this:

mov    result, case_default

cmp    val, 0
cmove  result, case0

cmp    val, 1
cmove  result, case1

cmp    val, 999
cmove  result, case999

The "cmove" instruction is "conditional move if equal/zero." If the previous
comparison (val == some case) is true, it will replace the contents of result
with the second operand.

This strategy also works particularly well on IA-64 where the results of a
comparison go into any one of 64 special 1-bit registers. In the example you
gave, the particular symmetry would allow the processor to do the correct
computation quickly without branching at all.

In other cases where one is not so fortunate to have nice facilities, you can
play games with masks. I won't get into all the many tricks & techniques for
doing this because it would make this post even longer than it already is. In
the case of MMX/SSE vector extensions to IA-32, comparisons will give you a
resulting mask of -1 or 0 that you can manipulate.

The following is the same code above implemented in SSE (vector of 4x4-bytes).
This is not necessarily the best choice to implement this particular code, but
it demonstrates how to use the masks to do what you want. Also, keep in mind
that, though there are a lot of instructions here, many can execute side-by-side
when scheduled properly. I left them unscheduled to make the code readable.

uint32 cases[] = {-1, 999, 1, 0}; // First case is default case kinda.

                                    // MSB -> LSB
                                    //        <MS3, MS2, LS1, LS0>
pshufd      xmm0, val, 0            // xmm0 = <val, val, val, val>
pshufd      xmm1, a, 0              // xmm1 = <  a,   a,   a,   a>
pshufd      xmm2, b, 0              // xmm2 = <  b,   b,   b,   b>
pshufd      xmm3, c, 0              // xmm3 = <  c,   c,   c,   c>

// Build mask for results
pcmpeqd     xmm0, cases             // xmm0 = conditional mask

// Build mask for arithmetic -- can be pregenerated for speed
pxor        xmm4, xmm4              // xmm4 = <  0,   0,   0,   0>
pxor        xmm5, xmm5              // xmm5 = <  0,   0,   0,   0>
pcmpeqd     xmm6, xmm6              // xmm6 = < -1,  -1,  -1,  -1>
punpckldq   xmm4, xmm6              // xmm4 = <  0,  -1,   0,  -1>
punpcklqdq  xmm5, xmm6              // xmm5 = <  0,   0,  -1,  -1>

// Conditional negation
pxor        xmm2, xmm4              // xmm2 = <  b,  !b,   b,  !b>
paddd       xmm2, xmm4              // xmm2 = <  b,  -b,   b,  -b>

// Same thing as above but for variable c
pxor        xmm3, xmm5              // xmm2 = <  c,   c,  !c,  !c>
paddd       xmm3, xmm5              // xmm2 = <  c,   c,  -c,  -c>

// Compute all 4 results at same time
paddd       xmm1, xmm2              // xmm1 = <a+b,   a-b,   a+b,   a-b>
paddd       xmm1, xmm3              // xmm1 = <a+b+c, a-b+c, a+b-c, a-b-c>

// Side-track for a moment to generate another mask
psrldw      xmm4, 64                // xmm4 = <  0,   0,   0,  -1>

// Save default case
movd        xmm5, xmm1              // xmm5 = a-b-c

// Deselect default case
pandn       xmm4, xmm1              // xmm4 = <a+b+c, a-b+c, a+b-c, 0>

// Select result
// if val == 0,   xmm1 = <a+b+c, 0, 0, 0>
// if val == 1,   xmm1 = <0, a-b+c, 0, 0>
// if val == 999, xmm1 = <0, 0, a+b-c, 0>
pand        xmm4, xmm0

// Fold (and find) result
movdqa      xmm0, xmm4              // xmm0 = <a+b+c, a-b+c, a+b-c, 0>
punpckhqdq  xmm4, xmm4              // xmm4 = <a+b+c, a-b+c, a+b+c, a-b+c>
por         xmm0, xmm4
psrldq      xmm0, 32
por         xmm0, xmm4              // xmm0 = result except in default case

// One last thing: merge default case!
pxor        xmm1, xmm1              // xmm1 = 0
pcmpeqd     xmm1, xmm0              // xmm1 = (default ? -1 : 0)
pand        xmm1, xmm5
por         xmm1, xmm0              // xmm1 = result
movd        result, xmm1

-Matt



This page took 0.35 seconds to execute

Last modified: Thu, 07 Jul 11 08:48:38 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.