Author: Robert Hyatt
Date: 19:09:14 05/30/04
Go up one level in this thread
On May 30, 2004 at 17:15:09, Robert Hyatt wrote: >On May 30, 2004 at 16:25:10, Vincent Diepeveen wrote: > >>On May 30, 2004 at 16:15:54, Robert Hyatt wrote: >> >>>On May 30, 2004 at 15:41:30, Vincent Diepeveen wrote: >>> >>>>On May 29, 2004 at 11:30:27, Robert Hyatt wrote: >>>> >>>>[snip] >>>>>See above. _no_ improvement. Raw latency on opteron is 1/2 the raw latency on >>>>>the K7 and Intel boxes. But mapping adds 2 extra memory accesses on the opteron >>>>>which does away with any actual advantage... >>>>> >>>>> >>>>> >>>>>> >>>>>>Softwarebenches like linbench and such pumping sequential a few gigabytes >>>>>>through the machine and then divide that by the search time. Then you have >>>>>>bandwidth. 1/bandwidth = latency they claim. >>>>> >>>>> >>>>>But that is the latency _you_ are quoting when you say opteron is 1/2 the >>>>>latency of the K7. In your worst-case it is _not 1/2. It is the same. >>>> >>>>Let's show you the tested facts K7 versus A64: >>>>Opteron single cpu 2.5 cas versus k7 cas 2.5. Note the k7 has all memory banks >>>>filled the opteron does *not* it just has a single dimm and is single channel >>>>and not even dual channel. So actually the latency is better than shown here. >>>>Quad opteron tested at 120 ns latency for a single cpu in fact when i tried a >>>>while ago. >>>> >>>>E:\dblat>dblat 300000000 >>>>Setting up a random access pattern, may take a while >>>>Finished >>>>Random access: 13.156 s, 131.560 ns/access >>>>Testing same pattern again >>>>Random access: 13.374 s, 133.740 ns/access >>>>Setting up a different random access pattern, may take a while >>>>Finished >>>>Random access: 13.343 s, 133.430 ns/access >>>>Testing same pattern again >>>>Random access: 13.265 s, 132.650 ns/access >>>>Sequential access offset 1: 0.250 s, 2.500 ns/access >>>>Sequential access offset 2: 0.484 s, 4.840 ns/access >>>>Sequential access offset 4: 0.875 s, 8.750 ns/access >>>>Sequential access offset 8: 1.781 s, 17.810 ns/access >>>>Sequential access offset 16: 3.375 s, 33.750 ns/access >>>>Sequential access offset 32: 6.265 s, 62.650 ns/access >>>>Sequential access offset 64: 6.516 s, 65.160 ns/access >>>>Sequential access offset 128: 7.000 s, 70.000 ns/access >>>>Sequential access offset 256: 7.938 s, 79.380 ns/access >>>>Sequential access offset 512: 9.188 s, 91.880 ns/access >>>>Sequential access offset 1024: 9.875 s, 98.750 ns/access >>>> >>>>Now the dual k7. all banks filled. a-brand memory. >>>>C:\tries>dblat 300000000 >>>>Setting up a random access pattern, may take a while >>>>Finished >>>>Random access: 36.266 s, 362.660 ns/access >>>>Testing same pattern again >>>>Random access: 36.406 s, 364.060 ns/access >>>>Setting up a different random access pattern, may take a while >>>>Finished >>>>Random access: 36.250 s, 362.500 ns/access >>>>Testing same pattern again >>>>Random access: 36.484 s, 364.840 ns/access >>>>Sequential access offset 1: 0.906 s, 9.060 ns/access >>>>Sequential access offset 2: 1.766 s, 17.660 ns/access >>>>Sequential access offset 4: 3.437 s, 34.370 ns/access >>>>Sequential access offset 8: 6.891 s, 68.910 ns/access >>>>Sequential access offset 16: 13.875 s, 138.750 ns/access >>>>Sequential access offset 32: 19.093 s, 190.930 ns/access >>>>Sequential access offset 64: 19.156 s, 191.560 ns/access >>>>Sequential access offset 128: 19.328 s, 193.280 ns/access >>>>Sequential access offset 256: 19.719 s, 197.190 ns/access >>>>Sequential access offset 512: 20.437 s, 204.370 ns/access >>>>Sequential access offset 1024: 21.860 s, 218.600 ns/access >>>> >>>>So practical difference for computerchess : >>>> >>>>363 / 132 = 2.75 times faster latency for the opteron >>>> >>>>On die memory controller isn't that stupid nah? >>> >>>Never said it was. I _did_ say that if you blow out the TLB on the K7 and on >>>the Opteron, the average access times are close. >>> >>>raw latency on opteron is about 70ns to do _one_ memory read. To read a random >>>access word, where the TLB fails, requires 5 memory reads. No way to avoid it, >>>and it is going to cost 350ns. _period_. On the K7, average latency is about >>>125ns to do _one_ memory read. To read a random access word, where the TLB >>>fails, requires 3 memory reads. Or about 375ns. >>> >>>Those are _real_ numbers, reported by _many_ people including AMD. >> >>To do hashtable probes latency at opteron is as shown idem for k7. >> >>2.75 times faster on average. period. > >That "period" convinces me. After all, you are _never_ wrong on this kind of >stuff. At least not until someone _else_ does the test for themselves.. BTW you quoted some _really_ bad data for a 4=way opteron. I can guarantee you that memory latency varies depending on which CPU accesses memory and how far away it is. Your test probably even uses a small enough memory and allocates everything locally. That really understates opteron access latency... > > > >> >> >>> >>>I have no idea what your program above does, and really don't care. But the >>>opteron has a much bigger TLB, if you don't blow it out by referring to at least >>>2048 different pages, then you are not comparing apples to apples. Opteron has >>>1024 TLB entries. Enough to efficiently address 4 megs of RAM (1024 * 4kb >>>pages). Or if your O/S is smart enough, 2 gigs of ram with 1024 entries * 2M >>>page size. >> >>>But for true non-TLB assisted random accesses, it is 350ns period. There is >>>absolutely no way to avoid the 4-level page translation lookup stuff. Opteron >>>ends up doing almost twice as many memory accesses as the K7. Of course it can >>>2^48 virtual addresses, and 2^40 real addresses in its present form so it has >>>some advantages... >>> >>> >>>> >>>>>>I would prefer calling that 'streaming latency'. It's full name officially is >>>>>>though 'cross bandwidth latency'. >>>>>> >>>>>>For chesssoftware that cross bandwidth latency is completely irrelevant. >>>>>Not if you need to move blocks of data... >>>> >>>>That would make a funny chessprogram moving blocks of a few megabyte memory for >>>>each node :) >>> >>> >>> >>>Don't have to move blocks of a few megabytes. Just generating moves is enough >>>to take advantage of sequential reads... >> >>May i remind you the processors have a L1 and L2 cache? >> >>L2 cache read = 13 cycles which is roughly 5.41 nanoseconds at 2.4ghz opteron. >> >>Note the L2 cache from opteron is faster than any other processor. > >Not true, but I won't go off on that tangent... > > > >Who cares? Are you now off onto cache latency? You _were_ talking about RAM >latency. > >How about picking a topic and sticking with it? > >Whatever you say, there is _no_ way to prevent a random access read from doing 5 >memory accesses on the opteron. _no_ way at all. > >If your test is broken so that missing TLB entries access MMU tables that are in >L2, fine. But that may or may not be probable depending on the program used... > > >> >>> >>> >>> >>>> >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>>>>>> >>>>>>>>>>>The IID principle can also apply to some additional situations: >>>>>>>>>> >>>>>>>>>>>1) You have a hash move, but it's at depth-2 rather than depth-1. You can do >>>>>>>>>>>another IID layer in this case. >>>>>>>>>> >>>>>>>>>>In that case hashmoves works better of course. >>>>>>>>>> >>>>>>>>>>>2) Your fail-high hash move (for some engines the only possible kind of hash >>>>>>>>>>>move) fails low. Here you can do IID to get an alternative move. >>>>>>>>>> >>>>>>>>>>This is highly unlikely as your IID is at depth-i where i > 0. >>>>>>>>>> >>>>>>>>>>So most likely that hashmove is already from a position j >= depth - i, which >>>>>>>>>>makes IID a complete waste of your time. >>>>>>>>> >>>>>>>>>I meant an IID where the move that already failed low is thrown out. You want >>>>>>>>>the second-best move at the reduced depth. >>>>>>>> >>>>>>>>Use double nullmove. works better than IID and the first move you already get >>>>>>>>the best move :) >>>>>>> >>>>>>>The depth reduction is too high. More experiments are needed - but it would be >>>>>>>quite a coincidence if the best IID depth reduction just happened to be exactly >>>>>>>twice the best null move depth reduction. >>>>>>>> >>>>>>>>>Usually, you will waste a few nodes this way of course. The idea is to avoid-the >>>>>>>>>worst case scenario - of doing a full search through a bunch of other moves, >>>>>>>>>before finding the fail-high move. >>>>>>>> >>>>>>>>You can add 1000 conditions, but if something doesn't work in general, it won't >>>>>>>>work with 1000 conditions either. It just is harder to test in a way that >>>>>>>>objective and statistical significant conclusions are possible to statistical >>>>>>>>significant conclude whether it works or doesn't. >>>>>>>> >>>>>>> >>>>>>>In Rybka, IID works. Further, I haven't found any conditions which make it work >>>>>>>better, although I didn't try anything really fancy - just some comparisons >>>>>>>between current eval and the bound. Anyway, I read your reply to Tord, and will >>>>>>>keep retesting as the engine evolves. >>>>>> >>>>>>I didn't find a single condition under which it works for DIEP. It's just a >>>>>>waste of system time IMHO. >>>>> >>>>>Too bad. It works for me too. Used very selectively. >>>>> >>>>> >>>>> >>>>>> >>>>>>>>>> >>>>>>>>>>>And - as Tord mentioned - an IID search can be turned into the final >>>>>>>>>>>reduced-depth search, based on its result. >>>>>>>>>>>Vas >>>>>>>>>> >>>>>>>>>>Depth reducing the current search? >>>>>>>>>> >>>>>>>>>>Sounds like a rather bad idea to me. >>>>>>>>> >>>>>>>>>Well that's the million dollar question, isn't it? >>>>>>>> >>>>>>>>Seems there is 2 camps. >>>>>>>> >>>>>>>>I'm currently in the camp that i tried both worlds and concluded that depth >>>>>>>>reducing with nullmove is already enough. >>>>>>>> >>>>>>>>I can imagine last few plies some types of forward pruning somehow work. So far >>>>>>>>i could not prove that last though. >>>>>>>> >>>>>>>>I have a hard time believing that forward pruning in the entire tree is going to >>>>>>>>beat the nullmove pruning. >>>>>>>> >>>>>>>>We both are titled chessplayers, and i see simply that the few mistakes todays >>>>>>>>engines make, usually it is a dubious move caused by bugs in the forward >>>>>>>>pruning. >>>>>>>> >>>>>>>>Shredder is clearest example. >>>>>>> >>>>>>>Yes Shredder has some blind spots, but it can also search really deep, >>>>>>>especially when it's attacking. It's always nice to search deeper in the >>>>>>>critical lines. Anyway - I'm still checking out both camps. >>>>>> >>>>>>Well it's not so hard to add 7 plies to your search depth because your >>>>>>'selective search' might see 7 more (which in fact it does in diep). >>>>>> >>>>>>I prefer a 14 ply search depth with just nullmove above 18 with the chance that >>>>>>all your search lines are depth reduced and last few plies you supernullmove and >>>>>>in qsearch you lazy evaluate :) >>>>>> >>>>>>With forward pruning at every ply like shredder seems to do you only see faster >>>>>>what it sees anyway. What your eval doesn't see, search won't find either >>>>>>because you nonstop shorten such lines more than my '14 ply search depth' is >>>>>>doing. >>>>>> >>>>>>>The key is to think of the future - because it will soon be here. I really don't >>>>>>>care which search misses more tactics on some 32 bit 1 GHz machine ... >>>>>>>Vas >>>>>> >>>>>>Last time Shredder ran on a 1 Ghz machine at a world champs was in world champs >>>>>>London 2000, so those days are long gone. >>>>>> >>>>>>>> >>>>>>>>>Vas
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.