Author: Matt Taylor
Date: 19:48:35 12/26/02
Go up one level in this thread
>Hi, Matt.
>
>You've said this last thing before but I still don't understand. Could you show
>an example of this in pseudo-asm, without using C conditionals? ("cond ? true :
>false" I understand to be simply disguised branches.) Maybe you could base it
>on this code fragment:
>
> int val, result;
> ...
> switch(val) {
> case 0: result = a + b + c; break;
> case 1: result = a + b - c; break;
> case 999: result = a - b + c; break;
> default: result = a - b - c; break;
> }
> foo(result);
>
>Even assuming that you can compute all the results simultaneously at a fixed
>cost, how do you "conditionally select the result" without branching? The only
>thing I can see to do is to stuff the results in an array of size at least 1000
>and use the variable val as an index, which wouldn't seem too hot to me.
>
>I'm sure I'm missing something neat.
>
>-- Walter
>
>(I'll say more on our topic of bit flogging in a few days.)
Actually the ?: operator does not always resolve to a branch. In fact, in the
bitscan stuff, VC 7 generated awful output for the newest algorithm you had
posted when extended to 64-bits. It decided to branch; my code for the same
function didn't branch once.
In the example you have above, C code representing the assembly generated could
look like this:
int temp0, temp1, case0, case1, case999, case_default, result, default_mask;
temp0 = b - c;
temp1 = b + c;
case0 = a + temp0;
case1 = a + temp1;
case999 = a - temp0;
case_default = a - temp1;
result = 0;
default_mask = -1;
// Converge on result
result |= (val == 0 ? case0 : 0);
default_mask &= ~(val == 0);
result |= (val == 1 ? case1 : 0);
default_mask &= ~(val == 0);
result |= (val == 999 ? case999 : 0);
default_mask &= ~(val == 0);
result |= (case_default & default_mask);
You have to play some games inside the CPU, but the actual IA-32 asm would look
a bit like this:
mov result, case_default
cmp val, 0
cmove result, case0
cmp val, 1
cmove result, case1
cmp val, 999
cmove result, case999
The "cmove" instruction is "conditional move if equal/zero." If the previous
comparison (val == some case) is true, it will replace the contents of result
with the second operand.
This strategy also works particularly well on IA-64 where the results of a
comparison go into any one of 64 special 1-bit registers. In the example you
gave, the particular symmetry would allow the processor to do the correct
computation quickly without branching at all.
In other cases where one is not so fortunate to have nice facilities, you can
play games with masks. I won't get into all the many tricks & techniques for
doing this because it would make this post even longer than it already is. In
the case of MMX/SSE vector extensions to IA-32, comparisons will give you a
resulting mask of -1 or 0 that you can manipulate.
The following is the same code above implemented in SSE (vector of 4x4-bytes).
This is not necessarily the best choice to implement this particular code, but
it demonstrates how to use the masks to do what you want. Also, keep in mind
that, though there are a lot of instructions here, many can execute side-by-side
when scheduled properly. I left them unscheduled to make the code readable.
uint32 cases[] = {-1, 999, 1, 0}; // First case is default case kinda.
// MSB -> LSB
// <MS3, MS2, LS1, LS0>
pshufd xmm0, val, 0 // xmm0 = <val, val, val, val>
pshufd xmm1, a, 0 // xmm1 = < a, a, a, a>
pshufd xmm2, b, 0 // xmm2 = < b, b, b, b>
pshufd xmm3, c, 0 // xmm3 = < c, c, c, c>
// Build mask for results
pcmpeqd xmm0, cases // xmm0 = conditional mask
// Build mask for arithmetic -- can be pregenerated for speed
pxor xmm4, xmm4 // xmm4 = < 0, 0, 0, 0>
pxor xmm5, xmm5 // xmm5 = < 0, 0, 0, 0>
pcmpeqd xmm6, xmm6 // xmm6 = < -1, -1, -1, -1>
punpckldq xmm4, xmm6 // xmm4 = < 0, -1, 0, -1>
punpcklqdq xmm5, xmm6 // xmm5 = < 0, 0, -1, -1>
// Conditional negation
pxor xmm2, xmm4 // xmm2 = < b, !b, b, !b>
paddd xmm2, xmm4 // xmm2 = < b, -b, b, -b>
// Same thing as above but for variable c
pxor xmm3, xmm5 // xmm2 = < c, c, !c, !c>
paddd xmm3, xmm5 // xmm2 = < c, c, -c, -c>
// Compute all 4 results at same time
paddd xmm1, xmm2 // xmm1 = <a+b, a-b, a+b, a-b>
paddd xmm1, xmm3 // xmm1 = <a+b+c, a-b+c, a+b-c, a-b-c>
// Side-track for a moment to generate another mask
psrldw xmm4, 64 // xmm4 = < 0, 0, 0, -1>
// Save default case
movd xmm5, xmm1 // xmm5 = a-b-c
// Deselect default case
pandn xmm4, xmm1 // xmm4 = <a+b+c, a-b+c, a+b-c, 0>
// Select result
// if val == 0, xmm1 = <a+b+c, 0, 0, 0>
// if val == 1, xmm1 = <0, a-b+c, 0, 0>
// if val == 999, xmm1 = <0, 0, a+b-c, 0>
pand xmm4, xmm0
// Fold (and find) result
movdqa xmm0, xmm4 // xmm0 = <a+b+c, a-b+c, a+b-c, 0>
punpckhqdq xmm4, xmm4 // xmm4 = <a+b+c, a-b+c, a+b+c, a-b+c>
por xmm0, xmm4
psrldq xmm0, 32
por xmm0, xmm4 // xmm0 = result except in default case
// One last thing: merge default case!
pxor xmm1, xmm1 // xmm1 = 0
pcmpeqd xmm1, xmm0 // xmm1 = (default ? -1 : 0)
pand xmm1, xmm5
por xmm1, xmm0 // xmm1 = result
movd result, xmm1
-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.