Author: Robert Hyatt
Date: 09:04:38 02/13/03
Go up one level in this thread
On February 13, 2003 at 00:26:15, Vincent Diepeveen wrote: >On February 12, 2003 at 00:37:13, Robert Hyatt wrote: > >>On February 11, 2003 at 23:24:43, Tom Kerrigan wrote: >> >>>On February 11, 2003 at 22:39:48, Robert Hyatt wrote: >>> >>>>Your explanation was not bad, but your "no compiler can do this" is dead >>>>wrong. Visit Cray Research, search for their CFT compiler (or their C >>>>compiler) and see if you can find some papers on their optimizing. >>>>They _do_ exactly what you describe. They "lift" (or "hoist") instructions >>>>way back up in the instruction stream so that values are available when needed, >>>>which is _exactly_ what your OOO approach is doing in the hardware. >>> >>>They must be doing this according to static branch prediction, which is maybe >>>80% accurate, not > 90%, and all compilers have scope boundaries for this sort >>>of stuff, i.e., at loops or functions. OOOE has no such restrictions. It's just >>>a stream of instructions. >>> >>No No No. They do much of this with 100% accuracy. Because they make sure >>that the critical instructions are executed in _every_ path that reaches a >>critical point in the data-flow analysis of the program (the dependency graph >>for gcc users)... >> >>BTW OOOE has a huge limit. Something like 40-60 (I don't have my P6/etc >>manuals here at home) micro-ops in the reorder buffer. No way to do any >>OOOE beyond that very narrow peephole, while the compiler can see _much_ >>more, as much as it wants (and has the compile time) to look at... >> >>Someone posted an example of such code (I think it was Matt) showing >>Vincent how to dump branches. That is the idea here. The advantage of >>OOO execution is still around, but it is not as significant. This being >>the fact that the _real_ code doesn't get bigger, while when the compiler >>is busy doing these same kinds of optimizations, it is replicating various >>instructions to be sure they are completed by the time the DG says the >>result is going to be needed. So there is a bit of memory savings when the >>processor does the OOO stuff, and there is the advantage of exposing more >>registers when the real instructions get turned into micro-ops... but at >>least the latter is more a result of a horrible architecture (8 registers) >>as opposed to the fact the OOO execution is a huge boon for other architectures >>that are not so register-challenged... >> >> >> >>>>I would not say that either is particularly "better". They are "different" >>>>with different approaches to the same problem. The advantage of a ia64-type >>>>approach is that you can stretch the VLIW approach quite a ways, while it >>>>gets harder and harder to do it in an OOO architecture. You end up with more >>>>hardware in the reorder buffer logic than you have in the actual pipelines >>>>that do the real computation. >>> >>>Is that causing a problem other than offending some people's sensibilities? The >>>EV8 was going to have 8 int ALUs and it would have been perfectly viable with >>>today's processes. >> >>Sure. But given the choice of OOOE with 8 int alus, or no OOOE with 16 >>int alus and an instruction package large enough to feed them all, I would >>consider the latter seriously... >> >> >> >>> >>>>Perhaps. However the non-OOO Cray has always been right at the top of the >>>>overall performance heap, so that approach can fly as well and it has certainly >>> >>>I don't know much about Crays but a friend of mine told me that he ran some >>>uniprocessor tests on a Cray and it was roughly as fast as a fast 486. Have any >>>Crays been built in the last several years using actual Cray processors? >> >>Your friend was nuts. The first Cray-1 at 12.5ns clock would blow off any >>486 ever made. That machine could do 80 mips, which would not be bad for >>a 486. But it could do 6-8 _operations_ per clock cycle, and those are >>64 bit floating point operations. The 486 has no chance. > >16 processor 100Mhz Cray with cray blitz ==> 500k nps >(i remember you posting here it could do 29 integer instructions a clock. >now you post 6-8 operations a clock. still good compared to the 1 or 2 >the 486 can do a second. but your cray blitz didn't use them at all). First, you have a _real_ problem paying attention. The DTS article was written using a Cray C90, with 16 processors running at 4.1ns per clock or about 250mhz. No idea where your "100mhz x 16" comes from, but it is wrong. The machine I mentioned as the fastest Cray I ever personally ran on was a 32 processor T90, running at 2ns per clock or 500mhz. I have no idea where you got "29 integer instructions per clock" as that is not a number I have _ever_ used. And I believe that most understand vector machines and vector chaining, and realize that in a single clock that a single cpu can do at least four simultaneous operations, and since the T90 does an operation on a pair of values rather than on just one, that doubles to at least 8 operations per clock cycle. Per cpu. Which has _nothing_ to do with instructions per second or anything else, since this is a _vector_ architecture. > >16 processor 486 100Mhz == 1.6ghz > >1.6Ghz K7 crafty ==> 1.2 MLN nodes a second. > >So let's compare the totals again: > >1.6Ghz of Cray processing power with very fast RAM and doing 6-8 operations a >clock according to your latest quote ==> 500k nps > Using the C90 numbers that is correct. Using the T90 numbers that scales to about 7M nodes per second as you well know because I played a match against a T90 using my quad 700 and the quad 700 got smashed. >1.6Ghz K7 doing at most 3 instructions a clock and like 300 cycles latency to >get a 64 bytes cache line ==> 1.2MLN a second > >For chessprograms which have a lot of branches and those are *unavoidable*, >Latest Cray processor released is clocked at 1Ghz, so a 1 Ghz McKinley beats >that for chess programs hands down. That simply exposes your ignorance of what vector machines are all about. And I don't believe I can correct that ignorance with a short post here so I won't try. Find any good architecture book and read. Then you will see why you don't make comparisons as above and have people laughing at the comments. > >Even old 18.xx crafties get like 1.5MLN nodes a second or so if i remember well >what you posted here. > On what machine? 18.xx got 1M on my quad 700, up to a max of about 1.5M. >You just gotta know how to use the cray right Bob? > That's the point. Its all about vector operations. >Best regards, >Vincent > >>The Cray T932 was the last 64 bit machine they built that I used. And it >>can produce a FLOP count that no PC on the planet can come within a factor of >>10 of and that is being very generous. 2ns clock, 32 cpus, each cpu can read >>four words and write two words to memory per clock cycle, and with vector >>chaining, it can do at _least_ eight floating point operations per cycle per >>CPU. >> >> >>> >>>>>As for VC not generating cmovs, Matt, using cmovs on the P6 is slower than not >>>>>using them. That's why they're not generated. >>>>Under what circumstance? It is possible to have totally unpredictable >>> >>>Under the circumstance of running on a P6, like I said. The P6 has no real >>>support for cmovs; it microcodes them down to several uops and the whole process >>>takes longer than if you just did a branch. >> >> >>I did a branchless FirstOne() in asm a few weeks back here, just to test. >>It used a cmov, and it wasn't slower than the one with a branch. If the >>branch is accurately predictable, they should probably break even. If the >>branch is not very predictable, then the cmov should be faster... I don't >>have the details handy, but I can't imagine it turning into "several" uops. >>two or three, perhaps, assuming we are not talking about a memory reference >>thrown in which would add one. >> >> >> >> >>> >>>>And I believe the VC _will_ produce CMOV instructions, but you have to >>> >>>Eugene can speak to this better than I can, but I don't think VC ever produces >>>cmovs regardless of how you configure it. >>> >>>-Tom >> >> >>I'm not sure why, if that is true. The PIV has an even longer pipeline >>with a larger branch misprediction penalty...
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.