Author: Matt Taylor
Date: 22:13:31 02/19/03
Go up one level in this thread
On February 19, 2003 at 18:20:16, Tom Kerrigan wrote: >On February 19, 2003 at 02:53:13, Matt Taylor wrote: > >>On February 18, 2003 at 13:33:12, Tom Kerrigan wrote: >> >>>On February 16, 2003 at 03:03:03, Matt Taylor wrote: >>> >>>>On February 15, 2003 at 21:28:39, Tom Kerrigan wrote: >>>> >>>>>They are if they better represent computer chess than Crafty does. I'd bet most >>>>>chess programs out there don't use bitboards (i.e., 64 bit operations) or use >>>>>bitboards less than Crafty. Bitboards are almost certainly the reason why Crafty >>>>>performs well on I2 vs. the P4. >>>>Perhaps it is, perhaps it isn't. Athlon is much more efficient with 64-bit >>>>operations than Pentium 4 is, and the Athlon isn't pulling ahead by huge strides >>>>(in Crafty). >>> >>>How do you figure the Athlon is more efficient? And what do you mean by >>>operations? ANDing, ORing, etc.? How about loading, shifting, BSF, popcount, >>>etc.? >> >>How about faster in MMX, faster in shifting, and faster in arithmetic? An MMX >>implementation will be slower on the Pentium 4. Code written in C to do the >>equivalent will involve much shifting and arithmetic which will also be slower >>on the Pentium 4. As far as I can tell, the bsf instruction on P4 is not really >>any faster than Athlon's, but it is hard to say. >> >>Logical ops have equal cycle counts on both processors making the P4's higher >>clockrate advantageous; however, logical ops are hardly the only bitboard >>manipulations. > >So the Athlon is much faster, except that it's the same for logical ops and bsf. >And we apparently don't know if Crafty uses MMX, where the Athlon would have an >advantage (what are the latencies of 64-bit MMX shifts on each chip?), or if it >breaks the bitboards into 32 bit numbers and does "normal" operations on them, >in which case I don't see why the Athlon would be any faster than the P4. (I'd >be surprised if anything gets shifted by very much in Crafty so the P4's lack of >a wide barrel shifter is probably not a handicap.) > >-Tom MMX is irrelevant to Crafty as Crafty does not use MMX, but I know Gerd uses MMX/3DNow in IsiChess. Potentially MMX could be used to do 64-bit manipulations on x86 chips. The only reason why I mention MMX at all is to show that both compiler-optimized code and hand-optimized code will run better on Athlon. Latency for ALU & shift MMX ops on Athlon is 2 cycles. The Pentium 4 also has a latency of 2 clocks for ALU & shift ops. The more complex ones (psadbw, pmovmskb, maskmovq, etc.) are nasty on both chips and run up to 10 clocks. Athlon has a throughput of up to 4 MMX ops/cycle and no restrictions (other than DirectPath) that I know of. Pentium 4 can issue 1 MMX op per cycle to one of 4 MMX units that get shared with SSE/FP. While Athlon can't obtain results faster, it can do a lot more computation. In 32-bit code, the Pentium 4 is still penalized. Algorithms that select half of the 64-bit quantity to operate on will be slow as P4 has a slow setcc (5 lat/1.5 thpt). Intel does not list timing for cmov, but I would presume it is just as bad. On Athlon, the general rule is that any simple integer operation (ALU, shift, conditionals, etc.) is 1 cycle. The cmov and setcc instructions are no exception. Algorithms that do full 64-bit computation (shifts and arithmatic instructions in particular) will be penalized because shifts are slow and the full adder is also slow. The add and sub instructions are 0.5 cycles lat/thpt, but the adc and sbb instructions are 6-8 lat and 2-3 thpt. The reg-imm form is the smaller value and the reg-reg is the larger value. If you look at compiler-generated 64-bit code, it is not very fast. VC uses the algorithms listed in the K7 optimization manual, and for some things it does a particularly bad job of emulating 64-bit computation on x86. Add and subtract are relatively straightforward; adc and sbb instructions are very slow. The shld and shrd instructions are 6 clocks on Athlon, and I have no idea how fast they are on Pentium 4. Intel does not list timings for them. I would guess that they are at least 6 cycles, probably much worse due to a crippled shfiter. I examined GCC 3.2 code to compare with VC's. GCC did not decide to use a function call to shift, but it was still not pretty. GCC generates this for shift left by 1: ; low = eax ; high = edx shld edx, eax, 1 add eax, eax Shift left by constant (< 32) shld edx, eax, imm shl eax, imm Shift left by constant (>= 32) xor edx, edx shl eax, constant - 32 ; Now eax and edx are swapped Code optimized for Athlon or Pentium 4 is still as inefficient. -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.