Author: Robert Hyatt
Date: 17:10:19 02/17/03
Go up one level in this thread
On February 17, 2003 at 03:33:14, Matt Taylor wrote:
>On February 16, 2003 at 21:54:00, Robert Hyatt wrote:
>
>>On February 16, 2003 at 20:32:48, Matt Taylor wrote:
>>
>>>On February 16, 2003 at 12:18:11, Robert Hyatt wrote:
>>>
>>>>On February 16, 2003 at 03:03:03, Matt Taylor wrote:
>>>>
>>>>>On February 15, 2003 at 21:28:39, Tom Kerrigan wrote:
>>>>>
>>>>>>On February 13, 2003 at 19:40:45, Matt Taylor wrote:
>>>>>>
>>>>>>>You're not getting it. Logic on the processor for static branch prediction is
>>>>>>>80% accurate because auxillary information available to the compiler is thrown
>>>>>>>out. Consider the following loop:
>>>>>>>for(i = 0; i < 1000; i++)
>>>>>>> do_something();
>>>>>>
>>>>>>You're the one who's not getting it if you think processors have logic for
>>>>>>static branch prediction (hint: processors do dynamic prediction) or if you
>>>>>>think these are the kinds of branches that matter for execution or compilation.
>>>>>>(Any branch prediction scheme would predict your branch with 99.9% accuracy.)
>>>>>
>>>>>Doesn't matter what you call it. AMD seems to think my Athlon has static branch
>>>>>prediction. I'm not sure why you disagree.
>>>>>
>>>>>No, this is not the typical scenario. I didn't feel it was necessary to list all
>>>>>sorts of loops and show how the compiler can predict them for the sake of
>>>>>argument. Intel C already does a lot of this, and they've only scratched the
>>>>>surface.
>>>>>
>>>>>>>>Sure, you can avoid having an actual branch instruction. I'm asking you to think
>>>>>>>>deeper. How does that make the processor go any faster?
>>>>>>>
>>>>>>>No branch mispredict = no penalty. Not always possible, but it works well for
>>>>>>>short functions such as abs, min, and max. If it were not so, cmov would be a
>>>>>>>near useless instruction.
>>>>>>
>>>>>>That's true, and I forgot about that reason, I guess because branches are only
>>>>>>mispredicted 5% of the time. The reason why predication would be used more
>>>>>>aggressively on an in-order chip (i.e., why it's a big deal on IA-64) is because
>>>>>>it allows post-condition instructions to be issued without dependancies.
>>>>>
>>>>>The cmov instruction still says leagues. Dynamic prediction works very well on
>>>>>predictable branches. Not all branches are predictable.
>>>>>
>>>>>>>>No, more like 12 results and in only one case does the Itanium 2 outperform the
>>>>>>>>P4. And I think I've done a very good job explaining why Crafty runs faster on
>>>>>>>>the I2 than the P4.
>>>>>>>The speed of gcc and perl are rather irrelevant to Chess, aren't they?
>>>>>>
>>>>>>They are if they better represent computer chess than Crafty does. I'd bet most
>>>>>>chess programs out there don't use bitboards (i.e., 64 bit operations) or use
>>>>>>bitboards less than Crafty. Bitboards are almost certainly the reason why Crafty
>>>>>>performs well on I2 vs. the P4.
>>>>>>
>>>>>>-Tom
>>>>>
>>>>>Perhaps it is, perhaps it isn't. Athlon is much more efficient with 64-bit
>>>>>operations than Pentium 4 is, and the Athlon isn't pulling ahead by huge strides
>>>>>(in Crafty).
>>>>>
>>>>>-Matt
>>>>
>>>>
>>>>I can only speak for myself, but I don't have any 64 bit results for the new
>>>>athlons, non-disclosure or not, so I have no idea how my stuff will work there.
>>>
>>>64-bit ALU ops on K8 are 1-cycle (3 ops/cycle). I believe the bsf instruction is
>>>1 cycle slower (9 instead of 8 for reg-reg). I haven't read that anywhere; I'm
>>>inferring that based on the K7's algorithm for bsf and the fact that the K8 is
>>>pretty much the same design.
>>>
>>>For K7, all 32-bit ALU ops are 1-cycle + possible memory stalls. This means
>>>theoretical throughput of 1.5 ops/cycle in 64-bit code. Logical instructions are
>>>ok on the Pentium 4, but shifts and arithmetic are terrible. 64-bit arithmetic
>>>is going to cost 3-5 cycles, and the result won't be available for 10 cycles.
>>>Shifts are equally bad. For Athlon, 64-bit arithmetic costs 2 cycles. A 64-bit
>>>shift is up to 6 cycles. Special forms (like shift by 1) cost as little as 2
>>>cycles.
>>>
>>>In hand-written MMX (compiler does not generate MMX instructions!), Athlon can
>>>also do more concurrent MMX operations than Pentium 4, and they are usually
>>>faster as well. Using the MMX unit to do logical stuff means a theoretical rate
>>>of 3-4 ops/cycle.
>>>
>>>Any way you cut it, the Pentium 4 is much slower than Athlon doing non-logical
>>>64-bit ops. (Note that compare -could- be efficient, but compiler output usually
>>>isn't.)
>>>
>>>By the way, I attempted measuring the bsf instruction on a Pentium 4. The timing
>>>code which has been consistent and reliable for me on the Pentium 2, K6-2 400,
>>>and my Athlon reports all sorts of timings. As best I can determine, it has a
>>>latency of 5.5-8 cycles and a throughput rate of 1 every 1-3.5 cycles. It is
>>>probably faster (if only by a higher clockrate)...
>>>
>>>If you want to try it, here's how I do it in a nutshell:
>>>(The \n\t stuff is omitted for clarity.)
>>>
>>>asm("cpuid"
>>> "cpuid"
>>> "cpuid"
>>> "cpuid"
>>> "rdtsc"
>>> "movl %%eax, %0"
>>>// your code here
>>> "cpuid"
>>> "rdtsc"
>>> "subl %%eax, %0"
>>> "negl %0"
>>> : "=r" (delta_time)
>>> :
>>> : "%eax", "%ebx", "%ecx", "%edx");
>>>
>>>Since this includes the cost of a cpuid and mov, I also have a variable called
>>>base. It is computed as the above with no code in the middle. I subtract that
>>>out from any results I get.
>>>
>>>-Matt
>>
>>
>>what are all the cpuid's for???
>
>One is required for synchronization of the instruction stream. Three more are
>recommended by the Pentium 2 optimization manual for the purposes of timing.
>Apparently the cpuid instruction executes in variable time on the P6 core even
>under the same conditions. Intel recommended executing it 3 times before timing
>to get good results. I follow their recommendation even when not timing
>specifically on a P6-core processor. It never hurts to be paranoid...
>
>-Matt
OK. Thanks. I thought there might be something about the CPUID that I didn't
know about. IE doing it twice does something different than if you do it once.
Glad at least that that is "sane". :)
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.