Author: Robert Hyatt
Date: 19:47:01 02/09/03
Go up one level in this thread
On February 09, 2003 at 03:21:45, Tom Kerrigan wrote: >On February 09, 2003 at 00:14:46, Matt Taylor wrote: > >>On February 07, 2003 at 08:09:23, Tom Kerrigan wrote: >> >>>On February 07, 2003 at 03:10:46, Matt Taylor wrote: >>> >>>>There is another subtle difference, too; IA-64 is heavily optimized in software >>>>whereas IA-32 is heavily optimized in hardware. In IA-64 it is possible to >>>>achieve rates closer to the theoretical 6 instructions per clock than it is on >>>>IA-32. >>> >>>Possibly only because it runs at a much lower clock speed. >> >>Um, possibly because that is the philosophy in VLIW chip design... >> >>I stick a bunch of execution units (carefully picked, of course) in my CPU, just >>as I would if I were building the next Pentium. The difference is that I don't >>waste a lot of transistors on reordering and such to get more parallelism; I >>just let the compiler optimize for my specific mix. >> >>IA-64 comes much closer to theoretical speed because of things like predication >>and its loop counter. (Plus it uses a register stack like Sparc.) > >You're assuming that software scheduling does a better job than hardware >scheduling but you have no data to back up that assumption. Prefetching and >predication are very poor substitutes for out-of-order execution. They make >writing software (or at least compilers) more difficult and they often waste >valuable memory bandwidth and execution units. > >As for the SPARC register stack, it's widely accepted that it doesn't >significantly improve performance and it makes the register file big enough to >hurt clock speed (which is one of the main reasons why IA-64 chips are clocked >so slow). It all but prevents register file duplication or caching, like in >Alphas... The claim to fame for the sparc approach is simply "fast procedure calls". No register saving or restoring. It was a necessary trade-off since the first sparcs didn't have hardware integer multiply/divide which made procedure calls very frequent. > >>No, actually. I have never used a McKinley; I've only seen it on paper. Still, >>the P4 3.06 GHz has 512K of L2 cache, and the McKinley has 3 or 6 MB. Now I >>can't remember whether 6 MB is Itanium-III or McKinley. > >Doesn't matter for computer chess. Every program I know about (with the >exception of HIARCS) has a working set of < 256k. I have one that doesn't fit your working set limit... IE my attack lookup tables are 8 byte arrays of size [64][256] which turns into 128K bytes each if my math is right. I can point out four of those that are used _everywhere_ and that is only a start. I'd suspect my "working set" probably comes closer to 1-2MB for the engine alone... > >>>>significant portions of the CPU core are dedicated to MMX/SSE and no compiler >>>>can generate MMX/SSE code, but an astute assembly programmer can write code >>The Intel compiler can generate SSE2 (instead of x87) for floating point >>>calculations. I believe gcc has library functions that make use of MMX. >>This is not the same as saying "the compiler can vectorize code." I can > >Right. You said generate MMX/SSE code, not vectorize code. > >>MMX alone eats more than 10% of an older Athlon die -- about 4M transistors on a >>42M transistor chip. 10% is pretty significant. > >Where did you get that number? > >-Tom
This page took 0.02 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.