Author: Alessandro Damiani
Date: 09:21:56 06/10/98
Go up one level in this thread
On June 10, 1998 at 11:34:37, Vincent Diepeveen wrote: > >On June 08, 1998 at 22:50:16, Robert Hyatt wrote: > >[about generating speed] > >I had something left to post. After we talked at ICC, i got some >questions >so i'll put the comparision here too. > >At my Pro200 crafty is indeed faster, but that has a different reason: >i'm running a crafty version which is compiled with microsoft visual >c++, >just like my program is, and at your pro you compile under linux, so >you may be glad that there is a compiler for that platform called GCC. > >So i'll use that msvc++ version to compare with. >I took 2 positions. > >First i took openingsposition, as we start the game always over there, >then i secondly also took BK22 as you liked to see. > >Same test also done at a PII-266 EDORAM by Bradley Woodward, >to see how fast a PII is relatively (intel says PII processor is in fact >a >pentium pro processor so we should see the same speed differences) >to the programs. > >Openingsposition Pentium Pro EDO 60ns: > Crafty : 2,103,049 > Diep : 3,081,664 46.5% faster > >BK22 Pentium Pro EDO 60ns: > Crafty : 2,941,176 > Diep : 3,699,552 25.8% faster > >Openingsposition PII-266 EDO > Crafty : 2,557,544 > Diep : 3,906,250 52.7% faster > >BK22 PII-266 EDO > Crafty : 3,470,031 > Diep : 4,874,446 40.5% faster > >Note that i store a move with 4 ints, where crafty stores it in 1 int. >So i'm doing 4 times more and i do that about 50% faster. > >266/200 * 3669552 = 4880504 which equals 4874446 almost. >266/200 * 3081664 = 4098613 which is almost 200K slower. > >So we can say that diep gets faster when Mhz gets faster, where >crafty has big problems to become faster when Mhz gets faster. > >How comes? > >Lonnie tested the K6-266 SDRAM >risc machine to get 3.34M generations a second >at openings position, and at 300Mhz (100Mhz bus) it gets 3.75M >generations >a second. > >Still not getting near PII speed, but considering its price a very good >processor. > >>one thing I didn't explain that deserves a comment. Probably the >>fastest >>way I know to generate moves is the "move table" approach, which lets >>you >>almost reduce a move generator to a memory-to-memory copy. which on a >>PC >>is not a great thing. IE on the pentium pro/200mhz machine, a memory >>read takes about 20 clock cycles (100ns). Ditto for each memory write. >>So a memory copy runs at a rate of one word every 200ns, or about 5 >>million words per second, max. > >>When we take programs like darkthought or crafty (or other bitmap >>programs) to a 64 bit architecture, that has a FirstOne() or LeadZ() >>type instruction, then the bitmap generators get a big boost.. because >>we only have to do 1/2 the memory traffic (stores only) since the moves >>come from the 64bit register rather than memory. And while the CPU is >>waiting on the slow memory write, it can cycle back thru the loop, and >>be ready for the next store as soon as the current one finishes >>(actually, >>good architectures can handle a backlog of delayed writes, but they >>eventually fill the delayed write queue and stall. But the point is, >>a bitmap program will be able to drive memory at 10M words per second >>(assuming P6/200 memory speeds) which is 10M moves per second, rather >>than the read/write copy scheme that will be only 1/2 the speed. >> >>Once we get the firstone in hardware so we don't have to waddle off to >>a procedure for every move... > >Note that firstone is already in the hardware instruction at 386. >Called BSF and BSR. > >Speed of BSF/R at pentium pro: 2/3 micro-op and non pairable, where >normal >instructions like MOV are 1 micro-op and pairable. > >But supposing that this cache-to-register-copy gets as fast as you >want that firstone to be, then this cache approach always beats you, and >the facts are there. Please don't tell stories about why i could not >get my generation table out of cache but need to get it out of slow >memory, >caches get bigger and i have done EFFORT to get my tables within L2 >cache, and in the future that'll be in L1 cache too, where in the future >that firstone of your still is that slow at intel (and suppose BSF and >BSR >get faster then you still need the same slow loop to get the squares out >AFTER you have way more overhead than i did, where i get the squares >directly). > >Problems of bitboards are not the BSF and BSR instructions. >problem is that you still need that loop to get the squares out and that >you need to program for EVERY piece, where i with 1 general purpose >routine can generate all pieces. > wrong! pawn moves are generated differently: one shift, one and, and you have all to-squares at once. So we skip blocked pawns. Pawn captures in the similar way. Here bitboards are best. >If i would program for generation speed and store it in 1 int and >program >everything for every piece then i'm of course way more faster than the >current speed diff. > >The second problem of bitboards is that you can do only a limited number >of things with it. > >Suppose i have a table with mobility values for it for every square, >where >values are generally different, as controlling a1 is not as important >like controlling d4 is. Quite logical. > >Can you show some sample code how you count this with bitboards? >That's pure horror in my opinion. very simple (using my board representation: no rotated bitboards): int MobKnight[64]; int MobBishopDown[64][256]; int MobBishopUp[64][256]; int MobRookHorizontal[64][256]; int MobRookVertical[64][256]; this tables are precomputed and contains the mobility score. evaluating: mobility of a knight on square s: MobKnight[s] mobility of a bishop on square s: MobBishopDown[s][SlideIndexDD(s)]+ MobBishopUp[s][SlideIndexDU(s)] mobility of a rook on square s: MobRookHorizontal[s][SlideIndexH(s)]+ MobRookVertical[s][SlideIndexV(s)] no loops! the cost is O(1)! > >Further i use in my evaluation attacktables which consists out of a >number >of things, like what piece attacks it and the number of attackers. > >The number of attackers is something i use a lot in my eval. >I have this number in an array int A[64], which definitely is in the >cache. >So i lookup this with A[square]. > >Now i think this is also a major disadvantage of bitboards. To look it >up with bitboards you need to add all kinds of different things with >masses >of 'if'. why 'if's? to get #attackers to a square the cost is O(1). > >I'm not asking to show how you do this in bitboards, that would give you >nightmares. If you call a function then show the audience how much >things that function does, JUST TO CALCULATE IT FOR 1 SQUARE, >where i have all 64 ready. > >Where you are wating till you get an alpha for free, i'm waiting for a >processor >where branch prediction doesn't take that much systemtime when it >mispredicts. > >I have so many compares in my eval which takes 90% of systemtime, that >i'm totally depending on the speed of the processor after misprediction. > >That would speed up my program several times, where bitboards would >slow it down another 4 times or so, and take 2 times more code too, >which >causes again some extra delay. > >Greetings, >Vincent
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.