Author: Johan de Koning
Date: 01:21:35 09/05/03
Go up one level in this thread
On September 04, 2003 at 10:36:18, Robert Hyatt wrote: >On September 04, 2003 at 03:48:31, Johan de Koning wrote: > >>On September 03, 2003 at 14:53:23, Robert Hyatt wrote: >> >>>I finally found time to run a few tests. >>> >>>First, the set-up: I took all my bitmap stuff, added it up, and put it in >>>a structure that was then set up as an array. IE tdata[128] was an array >>>of these structures, each element being just enough bytes to hold all of what >>>I call a "position" including bitmap boards, hash signature, 50 move counter, >>>etc...) >>> >>>At the top of MakeMove() I added tdata[ply+1]=tdata[ply]; to do the copy >>>stuff. That's all. >>> >>>Some results: >>> >>>fine 70 searched to 39 plies deep: >>> >>>copy/make : 15.5 seconds, make/unmake 14.0 seconds (11% overhead) >>> >>>kopec 22 searched to 12 plies deep: >>> >>>copy/make : 32.4 seconds, make/unmake 28.3 seconds. (14.5% overhead) >>> >>>mate position searched 9 plies deep (mate in 10) >>> >>>copy/make : 8.9 seconds, make/unmake 7.5 seconds. (18.7% overhead) >> >>Thanks for finding the time. >> >>I do however have some suggestions to create even more valuable data. >>(Would you have expected anything else? :-) >> >>(a) Name hardware! >> Preferrably also name system, software, compiler. > >Sorry. I realized that was omitted last night. > >Hardware: Xeon 2.8ghz, 1 gig RAM, running linux (redhat 9). Compiler >was the latest Intel C compiler version 7.something. > >This is a Dell PowerEdge 2600 dual xeon box, although I only used one >proceessor for the test to avoid SMP variability. > >>(b) List the exact positions. >> Include some normalish positions. >> List N/s. >> To avoid the suspicion that fast eval and lazy eval cause these particular >> positions to have a particularly high N/s (hence high copy cost). > >I gave two of the positions, fine 70 and kopec22. The third was a mate >position from ACC4. Here are the FENs for all three: >fine70: /k/3p/p2P1p/P2P1P///K/ w >kopec22: 2r2rk1/1bqnbpp1/1p1ppn1p/pP6/N1P1P3/P2B1N1P/1B2QPP1/R2R2K1 b >ACC4: rnbq1b1r/p1pp1p1p/4k3/1p1NP1p1/2QP1p2/5N2/PP1B1KPP/n6R w Sorry, I couldn't remember if and where I ever archived the bratkopec set. I do however recognize the famous mate in 10 position. :-) But I was hoping for more common positions that spawn more commonly shaped searches, with also more typical N/s. Eg the most recently archived game by Crafty after 0,5,10,15,20,25,30 moves. If that was not an exceptionally tactical game of course. :-) >>(c) Make the copying dependent on a global flag. >> Eg: int DoCopyIf12 = 13, to make sure the memory layout is exactly the same >> for both runs. > >I did that already. That's how the tests were run. > >>(d) Run several times. >> Just to get an impression of the variance in the timing. >> > >Did that too as it is something my normal "test" script does due to the >oddball variability of SMP stuff. The NPS varies _very_ little. Most of >the runs show no variability at all, every now and then an unfavorable layout >in memory can cause some minor cache issues to show up, but way under 1% >variance. > >>(e) Check the code (or list it). >> To make sure the compiler doesn't do anything stupid (it happens). > >I always do that. However, I was sure there would be no tricks here as the >array of structs is global and I compile each module of the program separately >so that no global optimization can find the copy was not needed. I can cut/ >paste the front of MakeMove() to show that the copy was actually done _every_ >time MakeMove() was called. I wasn't worrying about the compiler doing smart things, because that would only help my side of the dispute. :-) And besides, I think compilers that try to outsmart the programmer should at least issue a non-maskable warning. test.c (35): Warning X9999: futile code annihilated! I was worring about stupid things, like generating a huge unrolled loop, or generating a byte-by-byte loop, or calling memcpy(). So yes, please list the top of MakeMove(). >>>That's not all the story. however. The copy/make approach requires an >>>extra register everywhere since the data structures have to be accessed >>>through a pointer (or via an array subscript, same thing). My test case >>>does not take care of that. But if you were to mark one register as >>>"unusable" for the compiler, the result would be worse, for certain. Since >>>these data values are accessed all over the place, a register has to be used >>>everywhere, which is going to add to the above, significantly. >> >>Indeed, "all the story" is always YMMV. >>On a RISCish machine registers are cheap while absolute adressing is expensive. >>Hence addressing data through struct pointers is usually faster. On the 386 >>registers are bloody frikking expensive. But pointers (in general) save you a >>lot of code space. :-) > >Actually, on RISC-type machines direct addressing is not expensive. It is >simply not possible. IE the SPARC. You have to first put the address in a >register, then do a load using that register (plus others if needed for >subscripting, etc.) That's what I meant with expensive. :-) On a RISC this /**/ x += GlobalAtAwkwardAddress[i]; may take 4 or 5 instructions and an extra (cheap) register. On a 386 it takes only one (long) instruction, provided x and i are registers. But this is of course a worst case / best case comparison. >I'm always concerned on the X86 since registers are almost missing in action. > >> >>> If it only adds >>>10% then the above numbers are back to what I originally saw when I speeded >>>things up by 25% by getting rid of copy/make. >> >>A fictional 10% won't suffice to even the balance, since you added prepare_undo >>to make() and unmake() to search(). A fictional 20% would probably do the job, >>but I won't buy that. > >I'm not sure I follow. The copy added about 15% overhead. If I take out the >cost of unmaking, then that overhead is going up. At present, with make and >unmake, the copy _still_ added 15% more, without even considering the issue of >losing one of four registers for the array accesses all over the place. Ok, let's stick to 15% for the sake of Pete. That would be an upper bound for the cost of copying itself and the L1 trashing it causes. That is however when *added* to a make/unmake architecture. With pure copy/make some of the 15% will be returned because of no partial state saves (data cache), less work (clocks and code cache), reduced complexity (branch prediction and future bugs). All in all (not counting future bugs) I think it adds up to about zero. Of course depending on the CPU, sizeof(position), N/s, and tree shape. Adding a fictional 10% for register pressure is not going to approach your original 25%, at least not on hardware of 1997 or later. And I'll skip a possibly lengthy discussion about register pressure by stating that it matters only in hot spots, where pointers (in general) are beneficial. >Again, all I can say about this is that for Crafty, on the P6/200, removing >the copy and adding Unmake() speeded me up 25%. This will almost certainly >vary program by program, and probably even with different hardware. But it >seems pretty consistent from the P6 to the PIV, based on the simple copy test >I ran so far... You're allowed to repeat it was a that-and-then-and-there thing, but it won't add anything to the discussion. For one thing, we will never know, hence the discussion will not get beyond the similar DBcool/DBsucks level. For another thing, it does *not* seem to be consistent with PIV (yes, I like to repeat things as well :-). While PIV is likely the worst target for copy/make. ... Johan
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.