Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Optimizing C code for speed

Author: Matt Taylor

Date: 11:01:00 01/05/03

Go up one level in this thread


On January 04, 2003 at 20:27:45, Bo Persson wrote:

>On January 04, 2003 at 12:59:12, Matt Taylor wrote:
>
>>On January 04, 2003 at 09:26:47, Bo Persson wrote:
>>
>>>On January 03, 2003 at 19:14:35, Matt Taylor wrote:
>>>
>>>>
>>>>That is definitely the best way to learn. Pick up an Intel manual (if you are
>>>>interested in learning Intel assembly), write a little C code, and study the
>>>>compiler output. Once you are familiar with the machine and instruction set, you
>>>>can learn all the assembly optimization tricks out of AMD/Intel optimization
>>>>manuals.
>>>>
>>>>I have a copy of the Intel 386 manual converted to HTML. The architecture hasn't
>>>>changed significantly since the 386.
>>>
>>>No, because that defined the x86 architecture.  :-)
>>>
>>>Unfortunately the instruction timings have changed. Several times. In different
>>>directions.
>>>
>>>Sigh!
>>>
>>>
>>>Bo Persson
>>>bop2@telia.com
>>
>>The timings are always changing, but there is little point anymore in paying
>>attention to specific timings. Many other optimizations (branch elimination,
>>unrolling, vectorization, etc.) pay off big, and if you write code that executes
>>well on P5/P6 core (Pentium, PPro, Pentium 2, etc.), it usually runs well on
>>modern K7/P7 as well.
>
>Sometimes, sometimes not.
>
>My complaints are about inconsistency. Intel first introduced BSR/BSF to get som
>(reasonably) fast bit instructions. Then they made them soooo slow on the P5,
>that they were actually slower than a table lookup. How do you do that?!
>
>Then on the P6 they were extremely fast, to be slow again on the Pentium 4...
>
>On the other hand, MOVZX and MOVSX, which we have been taught *not* to use,
>ever, are now suddenly in the core set for the Pentium 4. Among the rare 13
>instructions than can actually execute 2 per clock per ALU. **)

Well, superscalar execution has changed a few things. Past the Pentium, chips
have issues addressing partial registers because the pipeline expands to allow
higher frequencies. If you address ax, ah, or al on a Pentium, the processor can
combine it back into eax in the same cycle. Not so for P6, Athlon, or Pentium 4.
Here is where the movzx/movsx are vital -- you can avoid that overhead.

The bsf/bsr instructions didn't get slower for the Pentium; everything else got
a lot faster. Believe it or not, I've got a routine that can bsf faster than my
Athlon can in microcode, and the routine doesn't use any sort of table. The only
processor in recent times that has had a fast bsf/bsr was the P6 core family
(Pentium Pro/Pentium 2/Pentium 3/Celeron). It is not suprising, either, since
most people have no use for bsf/bsr.

The one thing that has always been true is that ALU ops are -fast-. They are the
core instruction set: add, or, adc, sbb, sub, and, xor, cmp, test, jcc, jmp (E9
- direct), mov, shifts/rotates, & lea. Other ALU ops that are usually pretty
fast include xchg, not, neg, setcc, cmovcc, movsx/movzx, push, pop, & mul.

Ever since the 486, ALU ops have been 1 cycle latency when operating on
registers. Going back to the 8086, they were still the fastest instructions the
processor could execute.

You are right if you come at the problem from the point of view of a legacy x86
programmer. The days of sly assembly tricks on x86 are gone...

>>Also, timings on superscalar processors are less meaningful because of
>>out-of-order execution engines.
>>
>>I'm optimizing some routines (~20-30 instructions) now that execute around 16
>>cycles, and I have only squeezed out 1 cycle over the course of 5 hours. It's
>>just not worth it unless you have a lot of time or no better way to achieve
>>performance.
>>
>>-Matt
>
>**) Limited to a total of 3 by the trace cache...
>
>
>Bo Persson
>bop2@telia.com

Yes, on Pentium 4, but I don't follow how that is relevant...?

-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.