Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: But that PC with crafty would beat Cray Blitz though

Author: Matt Taylor

Date: 22:47:40 02/12/03

Go up one level in this thread


On February 13, 2003 at 00:56:49, Vincent Diepeveen wrote:

>On February 13, 2003 at 00:50:18, Matt Taylor wrote:
>
>you will not soon understand chess programs Matt. In diep there is no
>unnecessary branches if i can avoid them without making code too difficult.

Sigh...

>i do not know about how you code in C of course.
>
>let me just ask this: what is faster:
>  evaluating log n of the code.
>
>Or evaluating a small part of the code branchless?
>
>==> log n
>
>if( general condition ) {
>  .. hundreds of eval patterns
>
>}
>else {
> .. hundreds of eval patterns
>}
>
>==> unavoidable branch and jump tables won't be faster either

Actually there will be a lot of repetition, and it is probably possible to
condense the code significantly using branchless techniques. No guarantees that
it will be faster.

Again, my disclaimer is that this is not always applicable (fast, small,
efficient), but my point is that branchless code is -possible-.

-Matt

>>On February 13, 2003 at 00:26:15, Vincent Diepeveen wrote:
>>
>>>On February 12, 2003 at 00:37:13, Robert Hyatt wrote:
>>>
>>>>On February 11, 2003 at 23:24:43, Tom Kerrigan wrote:
>>>>
>>>>>On February 11, 2003 at 22:39:48, Robert Hyatt wrote:
>>>>>
>>>>>>Your explanation was not bad, but your "no compiler can do this" is dead
>>>>>>wrong.  Visit Cray Research, search for their CFT compiler (or their C
>>>>>>compiler) and see if you can find some papers on their optimizing.
>>>>>>They _do_ exactly what you describe.  They "lift" (or "hoist") instructions
>>>>>>way back up in the instruction stream so that values are available when needed,
>>>>>>which is _exactly_ what your OOO approach is doing in the hardware.
>>>>>
>>>>>They must be doing this according to static branch prediction, which is maybe
>>>>>80% accurate, not > 90%, and all compilers have scope boundaries for this sort
>>>>>of stuff, i.e., at loops or functions. OOOE has no such restrictions. It's just
>>>>>a stream of instructions.
>>>>>
>>>>No No No.  They do much of this with 100% accuracy.  Because they make sure
>>>>that the critical instructions are executed in _every_ path that reaches a
>>>>critical point in the data-flow analysis of the program (the dependency graph
>>>>for gcc users)...
>>>>
>>>>BTW OOOE has a huge limit.  Something like 40-60 (I don't have my P6/etc
>>>>manuals here at home) micro-ops in the reorder buffer.  No way to do any
>>>>OOOE beyond that very narrow peephole, while the compiler can see _much_
>>>>more, as much as it wants (and has the compile time) to look at...
>>>>
>>>>Someone posted an example of such code (I think it was Matt) showing
>>>>Vincent how to dump branches.  That is the idea here.  The advantage of
>>>>OOO execution is still around, but it is not as significant.  This being
>>>>the fact that the _real_ code doesn't get bigger, while when the compiler
>>>>is busy doing these same kinds of optimizations, it is replicating various
>>>>instructions to be sure they are completed by the time the DG says the
>>>>result is going to be needed.  So there is a bit of memory savings when the
>>>>processor does the OOO stuff, and there is the advantage of exposing more
>>>>registers when the real instructions get turned into micro-ops...  but at
>>>>least the latter is more a result of a horrible architecture (8 registers)
>>>>as opposed to the fact the OOO execution is a huge boon for other architectures
>>>>that are not so register-challenged...
>>>>
>>>>
>>>>
>>>>>>I would not say that either is particularly "better".  They are "different"
>>>>>>with different approaches to the same problem.  The advantage of a ia64-type
>>>>>>approach is that you can stretch the VLIW approach quite a ways, while it
>>>>>>gets harder and harder to do it in an OOO architecture.  You end up with more
>>>>>>hardware in the reorder buffer logic than you have in the actual pipelines
>>>>>>that do the real computation.
>>>>>
>>>>>Is that causing a problem other than offending some people's sensibilities? The
>>>>>EV8 was going to have 8 int ALUs and it would have been perfectly viable with
>>>>>today's processes.
>>>>
>>>>Sure.  But given the choice of OOOE with 8 int alus, or no OOOE with 16
>>>>int alus and an instruction package large enough to feed them all, I would
>>>>consider the latter seriously...
>>>>
>>>>
>>>>
>>>>>
>>>>>>Perhaps.  However the non-OOO Cray has always been right at the top of the
>>>>>>overall performance heap, so that approach can fly as well and it has certainly
>>>>>
>>>>>I don't know much about Crays but a friend of mine told me that he ran some
>>>>>uniprocessor tests on a Cray and it was roughly as fast as a fast 486. Have any
>>>>>Crays been built in the last several years using actual Cray processors?
>>>>
>>>>Your friend was nuts.  The first Cray-1 at 12.5ns clock would blow off any
>>>>486 ever made.  That machine could do 80 mips, which would not be bad for
>>>>a 486.  But it could do 6-8 _operations_ per clock cycle, and those are
>>>>64 bit floating point operations.  The 486 has no chance.
>>>
>>>16 processor 100Mhz Cray with cray blitz ==> 500k nps
>>>(i remember you posting here it could do 29 integer instructions a clock.
>>>now you post 6-8 operations a clock. still good compared to the 1 or 2
>>>the 486 can do a second. but your cray blitz didn't use them at all).
>>>
>>>16 processor 486 100Mhz == 1.6ghz
>>>
>>>1.6Ghz K7 crafty ==> 1.2 MLN nodes a second.
>>
>>Bad comparison.
>>
>>>So let's compare the totals again:
>>>
>>>1.6Ghz of Cray processing power with very fast RAM and doing 6-8 operations a
>>>clock according to your latest quote ==> 500k nps
>>>
>>>1.6Ghz K7 doing at most 3 instructions a clock and like 300 cycles latency to
>>>get a 64 bytes cache line ==> 1.2MLN a second
>>>
>>>For chessprograms which have a lot of branches and those are *unavoidable*,
>>>Latest Cray processor released is clocked at 1Ghz, so a 1 Ghz McKinley beats
>>>that for chess programs hands down.
>>
>>No, not unavoidable. You have to think outside the box to figure out how to
>>resolve it without branching. Some people claim the box is all that exists; I
>>have been outside and I can say that most branches (particularly your eval ones)
>>are easy to resolve without branching at all.
>>
>>Again, basic strategy is as follows:
>>
>>convert:
>>
>>if (cond)
>>    val += computation;
>>
>>to:
>>
>>val += (cond ? computation : 0);
>>
>>which reduces to:
>>
>>val += -cond & computation;
>>
>>Correct me if I'm wrong, but I don't see any branch there. The condition can be
>>computed (0 or 1) without branching, and the computation can be computed without
>>branching. The resulting code is functionally equivalent to the original
>>branching version. It is useful for shorter branches usually. You can compute
>>the cost of either case based on probabilities and weights and use it to pick
>>the faster version. I hope it would be an insult to any reader's intelligence to
>>walk through the probability stuff.
>>
>>>Even old 18.xx crafties get like 1.5MLN nodes a second or so if i remember well
>>>what you posted here.
>><snip>
>>
>>Using more than 1 CPU. In Crafty, my dual AthlonMP 2000 (2x1.67 GHz) was
>>approximately as fast as a single 1 GHz McKinley. I'd wager that the AthlonXP
>>3000 (Barton, 2.13 GHz) and AthlonXP 2800 (Tbred-B, 2.25 GHz) can't tie that
>>score. I plan on replacing an old Thunderbird 1.2 GHz with an AthlonXP 2500
>>within a month. We can extrapolate the score.
>>
>>-Matt



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.