Author: Robert Hyatt
Date: 13:15:54 05/30/04
Go up one level in this thread
On May 30, 2004 at 15:41:30, Vincent Diepeveen wrote: >On May 29, 2004 at 11:30:27, Robert Hyatt wrote: > >[snip] >>See above. _no_ improvement. Raw latency on opteron is 1/2 the raw latency on >>the K7 and Intel boxes. But mapping adds 2 extra memory accesses on the opteron >>which does away with any actual advantage... >> >> >> >>> >>>Softwarebenches like linbench and such pumping sequential a few gigabytes >>>through the machine and then divide that by the search time. Then you have >>>bandwidth. 1/bandwidth = latency they claim. >> >> >>But that is the latency _you_ are quoting when you say opteron is 1/2 the >>latency of the K7. In your worst-case it is _not 1/2. It is the same. > >Let's show you the tested facts K7 versus A64: >Opteron single cpu 2.5 cas versus k7 cas 2.5. Note the k7 has all memory banks >filled the opteron does *not* it just has a single dimm and is single channel >and not even dual channel. So actually the latency is better than shown here. >Quad opteron tested at 120 ns latency for a single cpu in fact when i tried a >while ago. > >E:\dblat>dblat 300000000 >Setting up a random access pattern, may take a while >Finished >Random access: 13.156 s, 131.560 ns/access >Testing same pattern again >Random access: 13.374 s, 133.740 ns/access >Setting up a different random access pattern, may take a while >Finished >Random access: 13.343 s, 133.430 ns/access >Testing same pattern again >Random access: 13.265 s, 132.650 ns/access >Sequential access offset 1: 0.250 s, 2.500 ns/access >Sequential access offset 2: 0.484 s, 4.840 ns/access >Sequential access offset 4: 0.875 s, 8.750 ns/access >Sequential access offset 8: 1.781 s, 17.810 ns/access >Sequential access offset 16: 3.375 s, 33.750 ns/access >Sequential access offset 32: 6.265 s, 62.650 ns/access >Sequential access offset 64: 6.516 s, 65.160 ns/access >Sequential access offset 128: 7.000 s, 70.000 ns/access >Sequential access offset 256: 7.938 s, 79.380 ns/access >Sequential access offset 512: 9.188 s, 91.880 ns/access >Sequential access offset 1024: 9.875 s, 98.750 ns/access > >Now the dual k7. all banks filled. a-brand memory. >C:\tries>dblat 300000000 >Setting up a random access pattern, may take a while >Finished >Random access: 36.266 s, 362.660 ns/access >Testing same pattern again >Random access: 36.406 s, 364.060 ns/access >Setting up a different random access pattern, may take a while >Finished >Random access: 36.250 s, 362.500 ns/access >Testing same pattern again >Random access: 36.484 s, 364.840 ns/access >Sequential access offset 1: 0.906 s, 9.060 ns/access >Sequential access offset 2: 1.766 s, 17.660 ns/access >Sequential access offset 4: 3.437 s, 34.370 ns/access >Sequential access offset 8: 6.891 s, 68.910 ns/access >Sequential access offset 16: 13.875 s, 138.750 ns/access >Sequential access offset 32: 19.093 s, 190.930 ns/access >Sequential access offset 64: 19.156 s, 191.560 ns/access >Sequential access offset 128: 19.328 s, 193.280 ns/access >Sequential access offset 256: 19.719 s, 197.190 ns/access >Sequential access offset 512: 20.437 s, 204.370 ns/access >Sequential access offset 1024: 21.860 s, 218.600 ns/access > >So practical difference for computerchess : > >363 / 132 = 2.75 times faster latency for the opteron > >On die memory controller isn't that stupid nah? Never said it was. I _did_ say that if you blow out the TLB on the K7 and on the Opteron, the average access times are close. raw latency on opteron is about 70ns to do _one_ memory read. To read a random access word, where the TLB fails, requires 5 memory reads. No way to avoid it, and it is going to cost 350ns. _period_. On the K7, average latency is about 125ns to do _one_ memory read. To read a random access word, where the TLB fails, requires 3 memory reads. Or about 375ns. Those are _real_ numbers, reported by _many_ people including AMD. I have no idea what your program above does, and really don't care. But the opteron has a much bigger TLB, if you don't blow it out by referring to at least 2048 different pages, then you are not comparing apples to apples. Opteron has 1024 TLB entries. Enough to efficiently address 4 megs of RAM (1024 * 4kb pages). Or if your O/S is smart enough, 2 gigs of ram with 1024 entries * 2M page size. But for true non-TLB assisted random accesses, it is 350ns period. There is absolutely no way to avoid the 4-level page translation lookup stuff. Opteron ends up doing almost twice as many memory accesses as the K7. Of course it can 2^48 virtual addresses, and 2^40 real addresses in its present form so it has some advantages... > >>>I would prefer calling that 'streaming latency'. It's full name officially is >>>though 'cross bandwidth latency'. >>> >>>For chesssoftware that cross bandwidth latency is completely irrelevant. >>Not if you need to move blocks of data... > >That would make a funny chessprogram moving blocks of a few megabyte memory for >each node :) Don't have to move blocks of a few megabytes. Just generating moves is enough to take advantage of sequential reads... > >> >> >> >> >>> >>>>>>> >>>>>>>>The IID principle can also apply to some additional situations: >>>>>>> >>>>>>>>1) You have a hash move, but it's at depth-2 rather than depth-1. You can do >>>>>>>>another IID layer in this case. >>>>>>> >>>>>>>In that case hashmoves works better of course. >>>>>>> >>>>>>>>2) Your fail-high hash move (for some engines the only possible kind of hash >>>>>>>>move) fails low. Here you can do IID to get an alternative move. >>>>>>> >>>>>>>This is highly unlikely as your IID is at depth-i where i > 0. >>>>>>> >>>>>>>So most likely that hashmove is already from a position j >= depth - i, which >>>>>>>makes IID a complete waste of your time. >>>>>> >>>>>>I meant an IID where the move that already failed low is thrown out. You want >>>>>>the second-best move at the reduced depth. >>>>> >>>>>Use double nullmove. works better than IID and the first move you already get >>>>>the best move :) >>>> >>>>The depth reduction is too high. More experiments are needed - but it would be >>>>quite a coincidence if the best IID depth reduction just happened to be exactly >>>>twice the best null move depth reduction. >>>>> >>>>>>Usually, you will waste a few nodes this way of course. The idea is to avoid-the >>>>>>worst case scenario - of doing a full search through a bunch of other moves, >>>>>>before finding the fail-high move. >>>>> >>>>>You can add 1000 conditions, but if something doesn't work in general, it won't >>>>>work with 1000 conditions either. It just is harder to test in a way that >>>>>objective and statistical significant conclusions are possible to statistical >>>>>significant conclude whether it works or doesn't. >>>>> >>>> >>>>In Rybka, IID works. Further, I haven't found any conditions which make it work >>>>better, although I didn't try anything really fancy - just some comparisons >>>>between current eval and the bound. Anyway, I read your reply to Tord, and will >>>>keep retesting as the engine evolves. >>> >>>I didn't find a single condition under which it works for DIEP. It's just a >>>waste of system time IMHO. >> >>Too bad. It works for me too. Used very selectively. >> >> >> >>> >>>>>>> >>>>>>>>And - as Tord mentioned - an IID search can be turned into the final >>>>>>>>reduced-depth search, based on its result. >>>>>>>>Vas >>>>>>> >>>>>>>Depth reducing the current search? >>>>>>> >>>>>>>Sounds like a rather bad idea to me. >>>>>> >>>>>>Well that's the million dollar question, isn't it? >>>>> >>>>>Seems there is 2 camps. >>>>> >>>>>I'm currently in the camp that i tried both worlds and concluded that depth >>>>>reducing with nullmove is already enough. >>>>> >>>>>I can imagine last few plies some types of forward pruning somehow work. So far >>>>>i could not prove that last though. >>>>> >>>>>I have a hard time believing that forward pruning in the entire tree is going to >>>>>beat the nullmove pruning. >>>>> >>>>>We both are titled chessplayers, and i see simply that the few mistakes todays >>>>>engines make, usually it is a dubious move caused by bugs in the forward >>>>>pruning. >>>>> >>>>>Shredder is clearest example. >>>> >>>>Yes Shredder has some blind spots, but it can also search really deep, >>>>especially when it's attacking. It's always nice to search deeper in the >>>>critical lines. Anyway - I'm still checking out both camps. >>> >>>Well it's not so hard to add 7 plies to your search depth because your >>>'selective search' might see 7 more (which in fact it does in diep). >>> >>>I prefer a 14 ply search depth with just nullmove above 18 with the chance that >>>all your search lines are depth reduced and last few plies you supernullmove and >>>in qsearch you lazy evaluate :) >>> >>>With forward pruning at every ply like shredder seems to do you only see faster >>>what it sees anyway. What your eval doesn't see, search won't find either >>>because you nonstop shorten such lines more than my '14 ply search depth' is >>>doing. >>> >>>>The key is to think of the future - because it will soon be here. I really don't >>>>care which search misses more tactics on some 32 bit 1 GHz machine ... >>>>Vas >>> >>>Last time Shredder ran on a 1 Ghz machine at a world champs was in world champs >>>London 2000, so those days are long gone. >>> >>>>> >>>>>>Vas
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.