Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: But that PC with crafty would beat Cray Blitz though

Author: Vincent Diepeveen

Date: 21:56:49 02/12/03

Go up one level in this thread


On February 13, 2003 at 00:50:18, Matt Taylor wrote:

you will not soon understand chess programs Matt. In diep there is no
unnecessary branches if i can avoid them without making code too difficult.

i do not know about how you code in C of course.

let me just ask this: what is faster:
  evaluating log n of the code.

Or evaluating a small part of the code branchless?

==> log n

if( general condition ) {
  .. hundreds of eval patterns

}
else {
 .. hundreds of eval patterns
}

==> unavoidable branch and jump tables won't be faster either

>On February 13, 2003 at 00:26:15, Vincent Diepeveen wrote:
>
>>On February 12, 2003 at 00:37:13, Robert Hyatt wrote:
>>
>>>On February 11, 2003 at 23:24:43, Tom Kerrigan wrote:
>>>
>>>>On February 11, 2003 at 22:39:48, Robert Hyatt wrote:
>>>>
>>>>>Your explanation was not bad, but your "no compiler can do this" is dead
>>>>>wrong.  Visit Cray Research, search for their CFT compiler (or their C
>>>>>compiler) and see if you can find some papers on their optimizing.
>>>>>They _do_ exactly what you describe.  They "lift" (or "hoist") instructions
>>>>>way back up in the instruction stream so that values are available when needed,
>>>>>which is _exactly_ what your OOO approach is doing in the hardware.
>>>>
>>>>They must be doing this according to static branch prediction, which is maybe
>>>>80% accurate, not > 90%, and all compilers have scope boundaries for this sort
>>>>of stuff, i.e., at loops or functions. OOOE has no such restrictions. It's just
>>>>a stream of instructions.
>>>>
>>>No No No.  They do much of this with 100% accuracy.  Because they make sure
>>>that the critical instructions are executed in _every_ path that reaches a
>>>critical point in the data-flow analysis of the program (the dependency graph
>>>for gcc users)...
>>>
>>>BTW OOOE has a huge limit.  Something like 40-60 (I don't have my P6/etc
>>>manuals here at home) micro-ops in the reorder buffer.  No way to do any
>>>OOOE beyond that very narrow peephole, while the compiler can see _much_
>>>more, as much as it wants (and has the compile time) to look at...
>>>
>>>Someone posted an example of such code (I think it was Matt) showing
>>>Vincent how to dump branches.  That is the idea here.  The advantage of
>>>OOO execution is still around, but it is not as significant.  This being
>>>the fact that the _real_ code doesn't get bigger, while when the compiler
>>>is busy doing these same kinds of optimizations, it is replicating various
>>>instructions to be sure they are completed by the time the DG says the
>>>result is going to be needed.  So there is a bit of memory savings when the
>>>processor does the OOO stuff, and there is the advantage of exposing more
>>>registers when the real instructions get turned into micro-ops...  but at
>>>least the latter is more a result of a horrible architecture (8 registers)
>>>as opposed to the fact the OOO execution is a huge boon for other architectures
>>>that are not so register-challenged...
>>>
>>>
>>>
>>>>>I would not say that either is particularly "better".  They are "different"
>>>>>with different approaches to the same problem.  The advantage of a ia64-type
>>>>>approach is that you can stretch the VLIW approach quite a ways, while it
>>>>>gets harder and harder to do it in an OOO architecture.  You end up with more
>>>>>hardware in the reorder buffer logic than you have in the actual pipelines
>>>>>that do the real computation.
>>>>
>>>>Is that causing a problem other than offending some people's sensibilities? The
>>>>EV8 was going to have 8 int ALUs and it would have been perfectly viable with
>>>>today's processes.
>>>
>>>Sure.  But given the choice of OOOE with 8 int alus, or no OOOE with 16
>>>int alus and an instruction package large enough to feed them all, I would
>>>consider the latter seriously...
>>>
>>>
>>>
>>>>
>>>>>Perhaps.  However the non-OOO Cray has always been right at the top of the
>>>>>overall performance heap, so that approach can fly as well and it has certainly
>>>>
>>>>I don't know much about Crays but a friend of mine told me that he ran some
>>>>uniprocessor tests on a Cray and it was roughly as fast as a fast 486. Have any
>>>>Crays been built in the last several years using actual Cray processors?
>>>
>>>Your friend was nuts.  The first Cray-1 at 12.5ns clock would blow off any
>>>486 ever made.  That machine could do 80 mips, which would not be bad for
>>>a 486.  But it could do 6-8 _operations_ per clock cycle, and those are
>>>64 bit floating point operations.  The 486 has no chance.
>>
>>16 processor 100Mhz Cray with cray blitz ==> 500k nps
>>(i remember you posting here it could do 29 integer instructions a clock.
>>now you post 6-8 operations a clock. still good compared to the 1 or 2
>>the 486 can do a second. but your cray blitz didn't use them at all).
>>
>>16 processor 486 100Mhz == 1.6ghz
>>
>>1.6Ghz K7 crafty ==> 1.2 MLN nodes a second.
>
>Bad comparison.
>
>>So let's compare the totals again:
>>
>>1.6Ghz of Cray processing power with very fast RAM and doing 6-8 operations a
>>clock according to your latest quote ==> 500k nps
>>
>>1.6Ghz K7 doing at most 3 instructions a clock and like 300 cycles latency to
>>get a 64 bytes cache line ==> 1.2MLN a second
>>
>>For chessprograms which have a lot of branches and those are *unavoidable*,
>>Latest Cray processor released is clocked at 1Ghz, so a 1 Ghz McKinley beats
>>that for chess programs hands down.
>
>No, not unavoidable. You have to think outside the box to figure out how to
>resolve it without branching. Some people claim the box is all that exists; I
>have been outside and I can say that most branches (particularly your eval ones)
>are easy to resolve without branching at all.
>
>Again, basic strategy is as follows:
>
>convert:
>
>if (cond)
>    val += computation;
>
>to:
>
>val += (cond ? computation : 0);
>
>which reduces to:
>
>val += -cond & computation;
>
>Correct me if I'm wrong, but I don't see any branch there. The condition can be
>computed (0 or 1) without branching, and the computation can be computed without
>branching. The resulting code is functionally equivalent to the original
>branching version. It is useful for shorter branches usually. You can compute
>the cost of either case based on probabilities and weights and use it to pick
>the faster version. I hope it would be an insult to any reader's intelligence to
>walk through the probability stuff.
>
>>Even old 18.xx crafties get like 1.5MLN nodes a second or so if i remember well
>>what you posted here.
><snip>
>
>Using more than 1 CPU. In Crafty, my dual AthlonMP 2000 (2x1.67 GHz) was
>approximately as fast as a single 1 GHz McKinley. I'd wager that the AthlonXP
>3000 (Barton, 2.13 GHz) and AthlonXP 2800 (Tbred-B, 2.25 GHz) can't tie that
>score. I plan on replacing an old Thunderbird 1.2 GHz with an AthlonXP 2500
>within a month. We can extrapolate the score.
>
>-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.