Computer Chess Club Archives

Search

Terms

Messages

Subject: Re: 64-bit machines

Author: Matt Taylor

Date: 19:19:12 02/09/03

On February 09, 2003 at 21:01:26, Tom Kerrigan wrote:

>On February 09, 2003 at 19:36:48, Matt Taylor wrote:
>
>>>You're assuming that software scheduling does a better job than hardware
>>>scheduling but you have no data to back up that assumption. Prefetching and
>>>predication are very poor substitutes for out-of-order execution. They make
>>>writing software (or at least compilers) more difficult and they often waste
>>>valuable memory bandwidth and execution units.
>>Pentium 4 makes a pretty good counter-argument here. The compiler has temporal
>>advantages. The compiler does not have to produce a result in real-time. That
>>alone should make the conclusion obvious. More time enables the compiler to
>>examine a broader range of optimizations.
>
>More time for the compiler to try to simulate out of order execution, you mean.

The compiler has time to evaluate many different orderings. Furthermore, the
compiler has more flexibility in reordering; the processor can only reorder
certain things -- it can't reorder memory accesses, and it can't see the moving
this particular instruction ahead 1 cycle will prevent a cumulative
data-dependency stall.

>If static scheduling is better than dynamic, why does McKinley deliver fewer
>SPECint/GHz than the similarly clocked 21364, SPARC64, and PA-RISC 8700 chips?
>
>Also, IA-64's huge register set and strict pairing rules all but rule out SMT,
>which is an incredibly valuable source of ILP.

Actually, last I checked Sparc-64 scored a tad lower than IA-32 at the same
clock speed. I was looking just yesterday and noted that Sparc scores were
rather low.

You are right that HT is pointless on VLIW chips. How is this a weakness? It
means the chip is already efficient enough that HT would not help it. That is
the point of VLIW computing! You don't need things like HT because your machine
is -already- efficient. Conversely IA-32 is weak in the area of efficiency.
IA-64 can do up to 6 instructions per cycle; the best IA-32 offers is 3. Again,
IA-64 also reaches much closer to its maximum potential than IA-32 does by
virtue of being preordered.

>>I'm not sure why you look down on predication;
>...
>>I did not even mention prefetching because it is already present
>
>I think both are great and would make great additions to ISAs that don't already
>have them. But being _forced_ to use prefetching and predication to order
>post-branch instructions ahead of branches because your architecture doesn't
>support out of order execution is lame.

IA-64 does not force prefetching any more than IA-32 does. That's a silly
notion. Plus your evaluation of "lame" is rather subjective, don't you think? I
could call IA-32 "lame" for being such a convoluded architecture with a handful
of registers. IA-32 works, though.

Predication is a nice solution to the problem of branch prediction. So is the
IA-64 loop counter.

>>either. I can see easy ways to do predication. I'm also not sure how you
>>conclude that predication wastes memory bandwidth -- small branches will issue
>>the same instructions + branches, and both branches usually get loaded as part
>>of the same cache line. The predicated version uses fewer instructions and
>>actually reduces the bandwidth requirements.
>
>For code, sure. What about predicated loads?

Intel claims that if the predicate is false, the instruction becomes a no-op. It
is possible that they implement it by actually doing the computation and then
discarding the result. Even if this is true, it makes little difference; the
machine can use predication to compute the address and then issue 1 load
instruction.

IIRC, the IA-64 molecule only allows 1 load per bundle anyway. It would be more
efficient to predicate computation of the address and then issue 1 load.

>>As I understand it, IA-64 uses a stack like Sparc, but unlike Sparc it has
>>hardware that unwinds the stack automatically. That's a whole lot more
>>efficient.
>
>That has nothing to do with the high latency of the register file caused by
>having so damn many registers.

Is this an assumption, or do you have proof? It would be an awkward machine
indeed if general register accesses weren't 1-cycle.

I do think 128 general registers is a bit much. 64 would have been nice. That's
14 registers for parameter passing + 32 for locals/outputs + 16 globals.

>>>Doesn't matter for computer chess. Every program I know about (with the
>>>exception of HIARCS) has a working set of < 256k.
>>Code and data all fit in 256 KB? Impressive. I rarely see that even in programs
>>an order of magnitude less complex.
>>No hash tables?
>
>No, don't be stupid. A program's "working set" is the code/data that it accesses
>the vast majority of the time. Of course the program accesses code/data outside
>of its working set, but infrequently enough that it doesn't impact performance.
>If you run a chess program on a 1.9GHz Pentium 4 and it runs 26% faster than it
>does on a 1.5GHz P4, which is the case for most chess programs, you know that
>the program's working set is less than 256k, because the CPU core and its 256k
>L2 cache are the only things that scale linearly with the CPU's clock speed. And
>if you're already getting a linear speedup, adding 6MB of L3 cache won't improve
>on that.

I am aware of what a working set is. Like Russell, I too am curious how lookup
tables that are accessed for every node are "infrequently" accessed.

I did not mean to imply that the entire application fit in 256KB. However,
lookup tables are generally fairly big as the tradeoff between performance and
memory usually sways toward the latter.

>>I read it a while back. The article may have been discussing die size increases;
>>I really don't remember. 4M transistors is on the same order as the 3M figure
>>you posted, so it's reasonable. A much larger part of the die is the cache, but
>
>I didn't post 3M. I said 1M for MMX, and that's including however many
>transistors were necessary to double the Pentium's L1 caches.
>
>-Tom

You said 1M for MMX and 2M for SSE. 1+2=3. I lump media instructions together
because both MMX and SSE are equally useless to compilers. (Of course I'm
ignoring the Intel C exception of using scalar SSE -- not useful to chess
programs, not very good justification of SSE either when they could have
introduced new flat-register FP instructions.)

-Matt

Re: 64-bit machines Tom Kerrigan 20:31:35 02/09/03
- Re: 64-bit machines Matt Taylor 23:41:50 02/09/03
  - Re: 64-bit machines Tom Kerrigan 12:09:18 02/10/03
    - Re: 64-bit machines Matt Taylor 20:35:05 02/10/03
      - Re: 64-bit machines Jeremiah Penery 21:42:13 02/10/03
    - Re: 64-bit machines Robert Hyatt 12:58:28 02/10/03
- Re: 64-bit machines Eugene Nalimov 20:53:32 02/09/03
  - Re: 64-bit machines Tom Kerrigan 21:49:13 02/09/03

This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.