Author: Matt Taylor
Date: 17:32:48 02/16/03
Go up one level in this thread
On February 16, 2003 at 12:18:11, Robert Hyatt wrote:
>On February 16, 2003 at 03:03:03, Matt Taylor wrote:
>
>>On February 15, 2003 at 21:28:39, Tom Kerrigan wrote:
>>
>>>On February 13, 2003 at 19:40:45, Matt Taylor wrote:
>>>
>>>>You're not getting it. Logic on the processor for static branch prediction is
>>>>80% accurate because auxillary information available to the compiler is thrown
>>>>out. Consider the following loop:
>>>>for(i = 0; i < 1000; i++)
>>>> do_something();
>>>
>>>You're the one who's not getting it if you think processors have logic for
>>>static branch prediction (hint: processors do dynamic prediction) or if you
>>>think these are the kinds of branches that matter for execution or compilation.
>>>(Any branch prediction scheme would predict your branch with 99.9% accuracy.)
>>
>>Doesn't matter what you call it. AMD seems to think my Athlon has static branch
>>prediction. I'm not sure why you disagree.
>>
>>No, this is not the typical scenario. I didn't feel it was necessary to list all
>>sorts of loops and show how the compiler can predict them for the sake of
>>argument. Intel C already does a lot of this, and they've only scratched the
>>surface.
>>
>>>>>Sure, you can avoid having an actual branch instruction. I'm asking you to think
>>>>>deeper. How does that make the processor go any faster?
>>>>
>>>>No branch mispredict = no penalty. Not always possible, but it works well for
>>>>short functions such as abs, min, and max. If it were not so, cmov would be a
>>>>near useless instruction.
>>>
>>>That's true, and I forgot about that reason, I guess because branches are only
>>>mispredicted 5% of the time. The reason why predication would be used more
>>>aggressively on an in-order chip (i.e., why it's a big deal on IA-64) is because
>>>it allows post-condition instructions to be issued without dependancies.
>>
>>The cmov instruction still says leagues. Dynamic prediction works very well on
>>predictable branches. Not all branches are predictable.
>>
>>>>>No, more like 12 results and in only one case does the Itanium 2 outperform the
>>>>>P4. And I think I've done a very good job explaining why Crafty runs faster on
>>>>>the I2 than the P4.
>>>>The speed of gcc and perl are rather irrelevant to Chess, aren't they?
>>>
>>>They are if they better represent computer chess than Crafty does. I'd bet most
>>>chess programs out there don't use bitboards (i.e., 64 bit operations) or use
>>>bitboards less than Crafty. Bitboards are almost certainly the reason why Crafty
>>>performs well on I2 vs. the P4.
>>>
>>>-Tom
>>
>>Perhaps it is, perhaps it isn't. Athlon is much more efficient with 64-bit
>>operations than Pentium 4 is, and the Athlon isn't pulling ahead by huge strides
>>(in Crafty).
>>
>>-Matt
>
>
>I can only speak for myself, but I don't have any 64 bit results for the new
>athlons, non-disclosure or not, so I have no idea how my stuff will work there.
64-bit ALU ops on K8 are 1-cycle (3 ops/cycle). I believe the bsf instruction is
1 cycle slower (9 instead of 8 for reg-reg). I haven't read that anywhere; I'm
inferring that based on the K7's algorithm for bsf and the fact that the K8 is
pretty much the same design.
For K7, all 32-bit ALU ops are 1-cycle + possible memory stalls. This means
theoretical throughput of 1.5 ops/cycle in 64-bit code. Logical instructions are
ok on the Pentium 4, but shifts and arithmetic are terrible. 64-bit arithmetic
is going to cost 3-5 cycles, and the result won't be available for 10 cycles.
Shifts are equally bad. For Athlon, 64-bit arithmetic costs 2 cycles. A 64-bit
shift is up to 6 cycles. Special forms (like shift by 1) cost as little as 2
cycles.
In hand-written MMX (compiler does not generate MMX instructions!), Athlon can
also do more concurrent MMX operations than Pentium 4, and they are usually
faster as well. Using the MMX unit to do logical stuff means a theoretical rate
of 3-4 ops/cycle.
Any way you cut it, the Pentium 4 is much slower than Athlon doing non-logical
64-bit ops. (Note that compare -could- be efficient, but compiler output usually
isn't.)
By the way, I attempted measuring the bsf instruction on a Pentium 4. The timing
code which has been consistent and reliable for me on the Pentium 2, K6-2 400,
and my Athlon reports all sorts of timings. As best I can determine, it has a
latency of 5.5-8 cycles and a throughput rate of 1 every 1-3.5 cycles. It is
probably faster (if only by a higher clockrate)...
If you want to try it, here's how I do it in a nutshell:
(The \n\t stuff is omitted for clarity.)
asm("cpuid"
"cpuid"
"cpuid"
"cpuid"
"rdtsc"
"movl %%eax, %0"
// your code here
"cpuid"
"rdtsc"
"subl %%eax, %0"
"negl %0"
: "=r" (delta_time)
:
: "%eax", "%ebx", "%ecx", "%edx");
Since this includes the cost of a cpuid and mov, I also have a variable called
base. It is computed as the above with no code in the middle. I subtract that
out from any results I get.
-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.