Author: Matt Taylor
Date: 19:48:35 12/26/02
Go up one level in this thread
>Hi, Matt. > >You've said this last thing before but I still don't understand. Could you show >an example of this in pseudo-asm, without using C conditionals? ("cond ? true : >false" I understand to be simply disguised branches.) Maybe you could base it >on this code fragment: > > int val, result; > ... > switch(val) { > case 0: result = a + b + c; break; > case 1: result = a + b - c; break; > case 999: result = a - b + c; break; > default: result = a - b - c; break; > } > foo(result); > >Even assuming that you can compute all the results simultaneously at a fixed >cost, how do you "conditionally select the result" without branching? The only >thing I can see to do is to stuff the results in an array of size at least 1000 >and use the variable val as an index, which wouldn't seem too hot to me. > >I'm sure I'm missing something neat. > >-- Walter > >(I'll say more on our topic of bit flogging in a few days.) Actually the ?: operator does not always resolve to a branch. In fact, in the bitscan stuff, VC 7 generated awful output for the newest algorithm you had posted when extended to 64-bits. It decided to branch; my code for the same function didn't branch once. In the example you have above, C code representing the assembly generated could look like this: int temp0, temp1, case0, case1, case999, case_default, result, default_mask; temp0 = b - c; temp1 = b + c; case0 = a + temp0; case1 = a + temp1; case999 = a - temp0; case_default = a - temp1; result = 0; default_mask = -1; // Converge on result result |= (val == 0 ? case0 : 0); default_mask &= ~(val == 0); result |= (val == 1 ? case1 : 0); default_mask &= ~(val == 0); result |= (val == 999 ? case999 : 0); default_mask &= ~(val == 0); result |= (case_default & default_mask); You have to play some games inside the CPU, but the actual IA-32 asm would look a bit like this: mov result, case_default cmp val, 0 cmove result, case0 cmp val, 1 cmove result, case1 cmp val, 999 cmove result, case999 The "cmove" instruction is "conditional move if equal/zero." If the previous comparison (val == some case) is true, it will replace the contents of result with the second operand. This strategy also works particularly well on IA-64 where the results of a comparison go into any one of 64 special 1-bit registers. In the example you gave, the particular symmetry would allow the processor to do the correct computation quickly without branching at all. In other cases where one is not so fortunate to have nice facilities, you can play games with masks. I won't get into all the many tricks & techniques for doing this because it would make this post even longer than it already is. In the case of MMX/SSE vector extensions to IA-32, comparisons will give you a resulting mask of -1 or 0 that you can manipulate. The following is the same code above implemented in SSE (vector of 4x4-bytes). This is not necessarily the best choice to implement this particular code, but it demonstrates how to use the masks to do what you want. Also, keep in mind that, though there are a lot of instructions here, many can execute side-by-side when scheduled properly. I left them unscheduled to make the code readable. uint32 cases[] = {-1, 999, 1, 0}; // First case is default case kinda. // MSB -> LSB // <MS3, MS2, LS1, LS0> pshufd xmm0, val, 0 // xmm0 = <val, val, val, val> pshufd xmm1, a, 0 // xmm1 = < a, a, a, a> pshufd xmm2, b, 0 // xmm2 = < b, b, b, b> pshufd xmm3, c, 0 // xmm3 = < c, c, c, c> // Build mask for results pcmpeqd xmm0, cases // xmm0 = conditional mask // Build mask for arithmetic -- can be pregenerated for speed pxor xmm4, xmm4 // xmm4 = < 0, 0, 0, 0> pxor xmm5, xmm5 // xmm5 = < 0, 0, 0, 0> pcmpeqd xmm6, xmm6 // xmm6 = < -1, -1, -1, -1> punpckldq xmm4, xmm6 // xmm4 = < 0, -1, 0, -1> punpcklqdq xmm5, xmm6 // xmm5 = < 0, 0, -1, -1> // Conditional negation pxor xmm2, xmm4 // xmm2 = < b, !b, b, !b> paddd xmm2, xmm4 // xmm2 = < b, -b, b, -b> // Same thing as above but for variable c pxor xmm3, xmm5 // xmm2 = < c, c, !c, !c> paddd xmm3, xmm5 // xmm2 = < c, c, -c, -c> // Compute all 4 results at same time paddd xmm1, xmm2 // xmm1 = <a+b, a-b, a+b, a-b> paddd xmm1, xmm3 // xmm1 = <a+b+c, a-b+c, a+b-c, a-b-c> // Side-track for a moment to generate another mask psrldw xmm4, 64 // xmm4 = < 0, 0, 0, -1> // Save default case movd xmm5, xmm1 // xmm5 = a-b-c // Deselect default case pandn xmm4, xmm1 // xmm4 = <a+b+c, a-b+c, a+b-c, 0> // Select result // if val == 0, xmm1 = <a+b+c, 0, 0, 0> // if val == 1, xmm1 = <0, a-b+c, 0, 0> // if val == 999, xmm1 = <0, 0, a+b-c, 0> pand xmm4, xmm0 // Fold (and find) result movdqa xmm0, xmm4 // xmm0 = <a+b+c, a-b+c, a+b-c, 0> punpckhqdq xmm4, xmm4 // xmm4 = <a+b+c, a-b+c, a+b+c, a-b+c> por xmm0, xmm4 psrldq xmm0, 32 por xmm0, xmm4 // xmm0 = result except in default case // One last thing: merge default case! pxor xmm1, xmm1 // xmm1 = 0 pcmpeqd xmm1, xmm0 // xmm1 = (default ? -1 : 0) pand xmm1, xmm5 por xmm1, xmm0 // xmm1 = result movd result, xmm1 -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.