Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: But that PC with crafty would beat Cray Blitz though

Author: Vincent Diepeveen
Date: 23:01:36 02/12/03
On February 13, 2003 at 01:47:40, Matt Taylor wrote:

branchless code is not possible for gameplaying programs unless you want to
write the shortest program in the world: printf("i resign\n");

Not to mention computerchess there it is trivially not possible. Period.

>On February 13, 2003 at 00:56:49, Vincent Diepeveen wrote:
>
>>On February 13, 2003 at 00:50:18, Matt Taylor wrote:
>>
>>you will not soon understand chess programs Matt. In diep there is no
>>unnecessary branches if i can avoid them without making code too difficult.
>
>Sigh...
>
>>i do not know about how you code in C of course.
>>
>>let me just ask this: what is faster:
>>  evaluating log n of the code.
>>
>>Or evaluating a small part of the code branchless?
>>
>>==> log n
>>
>>if( general condition ) {
>>  .. hundreds of eval patterns
>>
>>}
>>else {
>> .. hundreds of eval patterns
>>}
>>
>>==> unavoidable branch and jump tables won't be faster either
>
>Actually there will be a lot of repetition, and it is probably possible to
>condense the code significantly using branchless techniques. No guarantees that
>it will be faster.
>
>Again, my disclaimer is that this is not always applicable (fast, small,
>efficient), but my point is that branchless code is -possible-.
>
>-Matt
>
>>>On February 13, 2003 at 00:26:15, Vincent Diepeveen wrote:
>>>
>>>>On February 12, 2003 at 00:37:13, Robert Hyatt wrote:
>>>>
>>>>>On February 11, 2003 at 23:24:43, Tom Kerrigan wrote:
>>>>>
>>>>>>On February 11, 2003 at 22:39:48, Robert Hyatt wrote:
>>>>>>
>>>>>>>Your explanation was not bad, but your "no compiler can do this" is dead
>>>>>>>wrong.  Visit Cray Research, search for their CFT compiler (or their C
>>>>>>>compiler) and see if you can find some papers on their optimizing.
>>>>>>>They _do_ exactly what you describe.  They "lift" (or "hoist") instructions
>>>>>>>way back up in the instruction stream so that values are available when needed,
>>>>>>>which is _exactly_ what your OOO approach is doing in the hardware.
>>>>>>
>>>>>>They must be doing this according to static branch prediction, which is maybe
>>>>>>80% accurate, not > 90%, and all compilers have scope boundaries for this sort
>>>>>>of stuff, i.e., at loops or functions. OOOE has no such restrictions. It's just
>>>>>>a stream of instructions.
>>>>>>
>>>>>No No No.  They do much of this with 100% accuracy.  Because they make sure
>>>>>that the critical instructions are executed in _every_ path that reaches a
>>>>>critical point in the data-flow analysis of the program (the dependency graph
>>>>>for gcc users)...
>>>>>
>>>>>BTW OOOE has a huge limit.  Something like 40-60 (I don't have my P6/etc
>>>>>manuals here at home) micro-ops in the reorder buffer.  No way to do any
>>>>>OOOE beyond that very narrow peephole, while the compiler can see _much_
>>>>>more, as much as it wants (and has the compile time) to look at...
>>>>>
>>>>>Someone posted an example of such code (I think it was Matt) showing
>>>>>Vincent how to dump branches.  That is the idea here.  The advantage of
>>>>>OOO execution is still around, but it is not as significant.  This being
>>>>>the fact that the _real_ code doesn't get bigger, while when the compiler
>>>>>is busy doing these same kinds of optimizations, it is replicating various
>>>>>instructions to be sure they are completed by the time the DG says the
>>>>>result is going to be needed.  So there is a bit of memory savings when the
>>>>>processor does the OOO stuff, and there is the advantage of exposing more
>>>>>registers when the real instructions get turned into micro-ops...  but at
>>>>>least the latter is more a result of a horrible architecture (8 registers)
>>>>>as opposed to the fact the OOO execution is a huge boon for other architectures
>>>>>that are not so register-challenged...
>>>>>
>>>>>
>>>>>
>>>>>>>I would not say that either is particularly "better".  They are "different"
>>>>>>>with different approaches to the same problem.  The advantage of a ia64-type
>>>>>>>approach is that you can stretch the VLIW approach quite a ways, while it
>>>>>>>gets harder and harder to do it in an OOO architecture.  You end up with more
>>>>>>>hardware in the reorder buffer logic than you have in the actual pipelines
>>>>>>>that do the real computation.
>>>>>>
>>>>>>Is that causing a problem other than offending some people's sensibilities? The
>>>>>>EV8 was going to have 8 int ALUs and it would have been perfectly viable with
>>>>>>today's processes.
>>>>>
>>>>>Sure.  But given the choice of OOOE with 8 int alus, or no OOOE with 16
>>>>>int alus and an instruction package large enough to feed them all, I would
>>>>>consider the latter seriously...
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>Perhaps.  However the non-OOO Cray has always been right at the top of the
>>>>>>>overall performance heap, so that approach can fly as well and it has certainly
>>>>>>
>>>>>>I don't know much about Crays but a friend of mine told me that he ran some
>>>>>>uniprocessor tests on a Cray and it was roughly as fast as a fast 486. Have any
>>>>>>Crays been built in the last several years using actual Cray processors?
>>>>>
>>>>>Your friend was nuts.  The first Cray-1 at 12.5ns clock would blow off any
>>>>>486 ever made.  That machine could do 80 mips, which would not be bad for
>>>>>a 486.  But it could do 6-8 _operations_ per clock cycle, and those are
>>>>>64 bit floating point operations.  The 486 has no chance.
>>>>
>>>>16 processor 100Mhz Cray with cray blitz ==> 500k nps
>>>>(i remember you posting here it could do 29 integer instructions a clock.
>>>>now you post 6-8 operations a clock. still good compared to the 1 or 2
>>>>the 486 can do a second. but your cray blitz didn't use them at all).
>>>>
>>>>16 processor 486 100Mhz == 1.6ghz
>>>>
>>>>1.6Ghz K7 crafty ==> 1.2 MLN nodes a second.
>>>
>>>Bad comparison.
>>>
>>>>So let's compare the totals again:
>>>>
>>>>1.6Ghz of Cray processing power with very fast RAM and doing 6-8 operations a
>>>>clock according to your latest quote ==> 500k nps
>>>>
>>>>1.6Ghz K7 doing at most 3 instructions a clock and like 300 cycles latency to
>>>>get a 64 bytes cache line ==> 1.2MLN a second
>>>>
>>>>For chessprograms which have a lot of branches and those are *unavoidable*,
>>>>Latest Cray processor released is clocked at 1Ghz, so a 1 Ghz McKinley beats
>>>>that for chess programs hands down.
>>>
>>>No, not unavoidable. You have to think outside the box to figure out how to
>>>resolve it without branching. Some people claim the box is all that exists; I
>>>have been outside and I can say that most branches (particularly your eval ones)
>>>are easy to resolve without branching at all.
>>>
>>>Again, basic strategy is as follows:
>>>
>>>convert:
>>>
>>>if (cond)
>>>    val += computation;
>>>
>>>to:
>>>
>>>val += (cond ? computation : 0);
>>>
>>>which reduces to:
>>>
>>>val += -cond & computation;
>>>
>>>Correct me if I'm wrong, but I don't see any branch there. The condition can be
>>>computed (0 or 1) without branching, and the computation can be computed without
>>>branching. The resulting code is functionally equivalent to the original
>>>branching version. It is useful for shorter branches usually. You can compute
>>>the cost of either case based on probabilities and weights and use it to pick
>>>the faster version. I hope it would be an insult to any reader's intelligence to
>>>walk through the probability stuff.
>>>
>>>>Even old 18.xx crafties get like 1.5MLN nodes a second or so if i remember well
>>>>what you posted here.
>>><snip>
>>>
>>>Using more than 1 CPU. In Crafty, my dual AthlonMP 2000 (2x1.67 GHz) was
>>>approximately as fast as a single 1 GHz McKinley. I'd wager that the AthlonXP
>>>3000 (Barton, 2.13 GHz) and AthlonXP 2800 (Tbred-B, 2.25 GHz) can't tie that
>>>score. I plan on replacing an old Thunderbird 1.2 GHz with an AthlonXP 2500
>>>within a month. We can extrapolate the score.
>>>
>>>-Matt
Re: But that PC with crafty would beat Cray Blitz though Matt Taylor 16:09:42 02/13/03
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.