Author: Vincent Diepeveen
Date: 23:01:36 02/12/03
Go up one level in this thread
On February 13, 2003 at 01:47:40, Matt Taylor wrote: branchless code is not possible for gameplaying programs unless you want to write the shortest program in the world: printf("i resign\n"); Not to mention computerchess there it is trivially not possible. Period. >On February 13, 2003 at 00:56:49, Vincent Diepeveen wrote: > >>On February 13, 2003 at 00:50:18, Matt Taylor wrote: >> >>you will not soon understand chess programs Matt. In diep there is no >>unnecessary branches if i can avoid them without making code too difficult. > >Sigh... > >>i do not know about how you code in C of course. >> >>let me just ask this: what is faster: >> evaluating log n of the code. >> >>Or evaluating a small part of the code branchless? >> >>==> log n >> >>if( general condition ) { >> .. hundreds of eval patterns >> >>} >>else { >> .. hundreds of eval patterns >>} >> >>==> unavoidable branch and jump tables won't be faster either > >Actually there will be a lot of repetition, and it is probably possible to >condense the code significantly using branchless techniques. No guarantees that >it will be faster. > >Again, my disclaimer is that this is not always applicable (fast, small, >efficient), but my point is that branchless code is -possible-. > >-Matt > >>>On February 13, 2003 at 00:26:15, Vincent Diepeveen wrote: >>> >>>>On February 12, 2003 at 00:37:13, Robert Hyatt wrote: >>>> >>>>>On February 11, 2003 at 23:24:43, Tom Kerrigan wrote: >>>>> >>>>>>On February 11, 2003 at 22:39:48, Robert Hyatt wrote: >>>>>> >>>>>>>Your explanation was not bad, but your "no compiler can do this" is dead >>>>>>>wrong. Visit Cray Research, search for their CFT compiler (or their C >>>>>>>compiler) and see if you can find some papers on their optimizing. >>>>>>>They _do_ exactly what you describe. They "lift" (or "hoist") instructions >>>>>>>way back up in the instruction stream so that values are available when needed, >>>>>>>which is _exactly_ what your OOO approach is doing in the hardware. >>>>>> >>>>>>They must be doing this according to static branch prediction, which is maybe >>>>>>80% accurate, not > 90%, and all compilers have scope boundaries for this sort >>>>>>of stuff, i.e., at loops or functions. OOOE has no such restrictions. It's just >>>>>>a stream of instructions. >>>>>> >>>>>No No No. They do much of this with 100% accuracy. Because they make sure >>>>>that the critical instructions are executed in _every_ path that reaches a >>>>>critical point in the data-flow analysis of the program (the dependency graph >>>>>for gcc users)... >>>>> >>>>>BTW OOOE has a huge limit. Something like 40-60 (I don't have my P6/etc >>>>>manuals here at home) micro-ops in the reorder buffer. No way to do any >>>>>OOOE beyond that very narrow peephole, while the compiler can see _much_ >>>>>more, as much as it wants (and has the compile time) to look at... >>>>> >>>>>Someone posted an example of such code (I think it was Matt) showing >>>>>Vincent how to dump branches. That is the idea here. The advantage of >>>>>OOO execution is still around, but it is not as significant. This being >>>>>the fact that the _real_ code doesn't get bigger, while when the compiler >>>>>is busy doing these same kinds of optimizations, it is replicating various >>>>>instructions to be sure they are completed by the time the DG says the >>>>>result is going to be needed. So there is a bit of memory savings when the >>>>>processor does the OOO stuff, and there is the advantage of exposing more >>>>>registers when the real instructions get turned into micro-ops... but at >>>>>least the latter is more a result of a horrible architecture (8 registers) >>>>>as opposed to the fact the OOO execution is a huge boon for other architectures >>>>>that are not so register-challenged... >>>>> >>>>> >>>>> >>>>>>>I would not say that either is particularly "better". They are "different" >>>>>>>with different approaches to the same problem. The advantage of a ia64-type >>>>>>>approach is that you can stretch the VLIW approach quite a ways, while it >>>>>>>gets harder and harder to do it in an OOO architecture. You end up with more >>>>>>>hardware in the reorder buffer logic than you have in the actual pipelines >>>>>>>that do the real computation. >>>>>> >>>>>>Is that causing a problem other than offending some people's sensibilities? The >>>>>>EV8 was going to have 8 int ALUs and it would have been perfectly viable with >>>>>>today's processes. >>>>> >>>>>Sure. But given the choice of OOOE with 8 int alus, or no OOOE with 16 >>>>>int alus and an instruction package large enough to feed them all, I would >>>>>consider the latter seriously... >>>>> >>>>> >>>>> >>>>>> >>>>>>>Perhaps. However the non-OOO Cray has always been right at the top of the >>>>>>>overall performance heap, so that approach can fly as well and it has certainly >>>>>> >>>>>>I don't know much about Crays but a friend of mine told me that he ran some >>>>>>uniprocessor tests on a Cray and it was roughly as fast as a fast 486. Have any >>>>>>Crays been built in the last several years using actual Cray processors? >>>>> >>>>>Your friend was nuts. The first Cray-1 at 12.5ns clock would blow off any >>>>>486 ever made. That machine could do 80 mips, which would not be bad for >>>>>a 486. But it could do 6-8 _operations_ per clock cycle, and those are >>>>>64 bit floating point operations. The 486 has no chance. >>>> >>>>16 processor 100Mhz Cray with cray blitz ==> 500k nps >>>>(i remember you posting here it could do 29 integer instructions a clock. >>>>now you post 6-8 operations a clock. still good compared to the 1 or 2 >>>>the 486 can do a second. but your cray blitz didn't use them at all). >>>> >>>>16 processor 486 100Mhz == 1.6ghz >>>> >>>>1.6Ghz K7 crafty ==> 1.2 MLN nodes a second. >>> >>>Bad comparison. >>> >>>>So let's compare the totals again: >>>> >>>>1.6Ghz of Cray processing power with very fast RAM and doing 6-8 operations a >>>>clock according to your latest quote ==> 500k nps >>>> >>>>1.6Ghz K7 doing at most 3 instructions a clock and like 300 cycles latency to >>>>get a 64 bytes cache line ==> 1.2MLN a second >>>> >>>>For chessprograms which have a lot of branches and those are *unavoidable*, >>>>Latest Cray processor released is clocked at 1Ghz, so a 1 Ghz McKinley beats >>>>that for chess programs hands down. >>> >>>No, not unavoidable. You have to think outside the box to figure out how to >>>resolve it without branching. Some people claim the box is all that exists; I >>>have been outside and I can say that most branches (particularly your eval ones) >>>are easy to resolve without branching at all. >>> >>>Again, basic strategy is as follows: >>> >>>convert: >>> >>>if (cond) >>> val += computation; >>> >>>to: >>> >>>val += (cond ? computation : 0); >>> >>>which reduces to: >>> >>>val += -cond & computation; >>> >>>Correct me if I'm wrong, but I don't see any branch there. The condition can be >>>computed (0 or 1) without branching, and the computation can be computed without >>>branching. The resulting code is functionally equivalent to the original >>>branching version. It is useful for shorter branches usually. You can compute >>>the cost of either case based on probabilities and weights and use it to pick >>>the faster version. I hope it would be an insult to any reader's intelligence to >>>walk through the probability stuff. >>> >>>>Even old 18.xx crafties get like 1.5MLN nodes a second or so if i remember well >>>>what you posted here. >>><snip> >>> >>>Using more than 1 CPU. In Crafty, my dual AthlonMP 2000 (2x1.67 GHz) was >>>approximately as fast as a single 1 GHz McKinley. I'd wager that the AthlonXP >>>3000 (Barton, 2.13 GHz) and AthlonXP 2800 (Tbred-B, 2.25 GHz) can't tie that >>>score. I plan on replacing an old Thunderbird 1.2 GHz with an AthlonXP 2500 >>>within a month. We can extrapolate the score. >>> >>>-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.