Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: 64-bit machines

Author: Matt Taylor

Date: 16:36:48 02/09/03

Go up one level in this thread


On February 09, 2003 at 03:21:45, Tom Kerrigan wrote:

>On February 09, 2003 at 00:14:46, Matt Taylor wrote:
>
>>On February 07, 2003 at 08:09:23, Tom Kerrigan wrote:
>>
>>>On February 07, 2003 at 03:10:46, Matt Taylor wrote:
>>>
>>>>There is another subtle difference, too; IA-64 is heavily optimized in software
>>>>whereas IA-32 is heavily optimized in hardware. In IA-64 it is possible to
>>>>achieve rates closer to the theoretical 6 instructions per clock than it is on
>>>>IA-32.
>>>
>>>Possibly only because it runs at a much lower clock speed.
>>
>>Um, possibly because that is the philosophy in VLIW chip design...
>>
>>I stick a bunch of execution units (carefully picked, of course) in my CPU, just
>>as I would if I were building the next Pentium. The difference is that I don't
>>waste a lot of transistors on reordering and such to get more parallelism; I
>>just let the compiler optimize for my specific mix.
>>
>>IA-64 comes much closer to theoretical speed because of things like predication
>>and its loop counter. (Plus it uses a register stack like Sparc.)
>
>You're assuming that software scheduling does a better job than hardware
>scheduling but you have no data to back up that assumption. Prefetching and
>predication are very poor substitutes for out-of-order execution. They make
>writing software (or at least compilers) more difficult and they often waste
>valuable memory bandwidth and execution units.

Pentium 4 makes a pretty good counter-argument here. The compiler has temporal
advantages. The compiler does not have to produce a result in real-time. That
alone should make the conclusion obvious. More time enables the compiler to
examine a broader range of optimizations.

I'm not sure why you look down on predication; so branch misprediction is a
better solution? -Many- branches are very short and involve similar computation.
I did not even mention prefetching because it is already present in IA-32 (since
the K6-2 and Pentium 3), and yes, it is difficult for a compiler to take
advantage of that.

I'm not sure how you conclude that predication is difficult to implement,
either. I can see easy ways to do predication. I'm also not sure how you
conclude that predication wastes memory bandwidth -- small branches will issue
the same instructions + branches, and both branches usually get loaded as part
of the same cache line. The predicated version uses fewer instructions and
actually reduces the bandwidth requirements.

>As for the SPARC register stack, it's widely accepted that it doesn't
>significantly improve performance and it makes the register file big enough to
>hurt clock speed (which is one of the main reasons why IA-64 chips are clocked
>so slow). It all but prevents register file duplication or caching, like in
>Alphas...

As I understand it, IA-64 uses a stack like Sparc, but unlike Sparc it has
hardware that unwinds the stack automatically. That's a whole lot more
efficient.

>>No, actually. I have never used a McKinley; I've only seen it on paper. Still,
>>the P4 3.06 GHz has 512K of L2 cache, and the McKinley has 3 or 6 MB. Now I
>>can't remember whether 6 MB is Itanium-III or McKinley.
>
>Doesn't matter for computer chess. Every program I know about (with the
>exception of HIARCS) has a working set of < 256k.

Code and data all fit in 256 KB? Impressive. I rarely see that even in programs
an order of magnitude less complex.

No hash tables?

>>>>significant portions of the CPU core are dedicated to MMX/SSE and no compiler
>>>>can generate MMX/SSE code, but an astute assembly programmer can write code >>The Intel compiler can generate SSE2 (instead of x87) for floating point
>>>calculations. I believe gcc has library functions that make use of MMX.
>>This is not the same as saying "the compiler can vectorize code." I can
>
>Right. You said generate MMX/SSE code, not vectorize code.

They are generally the same. With the exception of SSE scalar code, the
instruction sets are all vector operations.

>>MMX alone eats more than 10% of an older Athlon die -- about 4M transistors on a
>>42M transistor chip. 10% is pretty significant.
>
>Where did you get that number?

I was wrong about the 42M transistor count; Palomino is 37.5M. I think the 42M
was for the Williamette (or Northwood?) P4.

I read it a while back. The article may have been discussing die size increases;
I really don't remember. 4M transistors is on the same order as the 3M figure
you posted, so it's reasonable. A much larger part of the die is the cache, but
in terms of execution units themselves, MMX/SSE are extremely significant.

In the case of Athlon, SSE/3DNow are really the same thing. The transition from
Thunderbird to Palomino included prefecthing logic and SSE; however, the
transistor count only increased by 0.5M. I don't have any data to support it,
but I suspect that SSE ops get processed as 2 internal 3DNow! ops. An SSE
instruction has twice the latency of its equivalent 3DNow! instruction.

-Matt



This page took 0.01 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.