Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Processors have a L2 cache mr Hyatt

Author: Robert Hyatt

Date: 19:09:14 05/30/04

Go up one level in this thread


On May 30, 2004 at 17:15:09, Robert Hyatt wrote:

>On May 30, 2004 at 16:25:10, Vincent Diepeveen wrote:
>
>>On May 30, 2004 at 16:15:54, Robert Hyatt wrote:
>>
>>>On May 30, 2004 at 15:41:30, Vincent Diepeveen wrote:
>>>
>>>>On May 29, 2004 at 11:30:27, Robert Hyatt wrote:
>>>>
>>>>[snip]
>>>>>See above.  _no_ improvement.  Raw latency on opteron is 1/2 the raw latency on
>>>>>the K7 and Intel boxes.  But mapping adds 2 extra memory accesses on the opteron
>>>>>which does away with any actual advantage...
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>Softwarebenches like linbench and such pumping sequential a few gigabytes
>>>>>>through the machine and then divide that by the search time. Then you have
>>>>>>bandwidth. 1/bandwidth = latency they claim.
>>>>>
>>>>>
>>>>>But that is the latency _you_ are quoting when you say opteron is 1/2 the
>>>>>latency of the K7.  In your worst-case it is _not 1/2.  It is the same.
>>>>
>>>>Let's show you the tested facts K7 versus A64:
>>>>Opteron single cpu 2.5 cas versus k7 cas 2.5. Note the k7 has all memory banks
>>>>filled the opteron does *not* it just has a single dimm and is single channel
>>>>and not even dual channel. So actually the latency is better than shown here.
>>>>Quad opteron tested at 120 ns latency for a single cpu in fact when i tried a
>>>>while ago.
>>>>
>>>>E:\dblat>dblat 300000000
>>>>Setting up a random access pattern, may take a while
>>>>Finished
>>>>Random access:  13.156 s, 131.560 ns/access
>>>>Testing same pattern again
>>>>Random access:  13.374 s, 133.740 ns/access
>>>>Setting up a different random access pattern, may take a while
>>>>Finished
>>>>Random access:  13.343 s, 133.430 ns/access
>>>>Testing same pattern again
>>>>Random access:  13.265 s, 132.650 ns/access
>>>>Sequential access offset     1:   0.250 s,   2.500 ns/access
>>>>Sequential access offset     2:   0.484 s,   4.840 ns/access
>>>>Sequential access offset     4:   0.875 s,   8.750 ns/access
>>>>Sequential access offset     8:   1.781 s,  17.810 ns/access
>>>>Sequential access offset    16:   3.375 s,  33.750 ns/access
>>>>Sequential access offset    32:   6.265 s,  62.650 ns/access
>>>>Sequential access offset    64:   6.516 s,  65.160 ns/access
>>>>Sequential access offset   128:   7.000 s,  70.000 ns/access
>>>>Sequential access offset   256:   7.938 s,  79.380 ns/access
>>>>Sequential access offset   512:   9.188 s,  91.880 ns/access
>>>>Sequential access offset  1024:   9.875 s,  98.750 ns/access
>>>>
>>>>Now the dual k7. all banks filled. a-brand memory.
>>>>C:\tries>dblat 300000000
>>>>Setting up a random access pattern, may take a while
>>>>Finished
>>>>Random access:  36.266 s, 362.660 ns/access
>>>>Testing same pattern again
>>>>Random access:  36.406 s, 364.060 ns/access
>>>>Setting up a different random access pattern, may take a while
>>>>Finished
>>>>Random access:  36.250 s, 362.500 ns/access
>>>>Testing same pattern again
>>>>Random access:  36.484 s, 364.840 ns/access
>>>>Sequential access offset     1:   0.906 s,   9.060 ns/access
>>>>Sequential access offset     2:   1.766 s,  17.660 ns/access
>>>>Sequential access offset     4:   3.437 s,  34.370 ns/access
>>>>Sequential access offset     8:   6.891 s,  68.910 ns/access
>>>>Sequential access offset    16:  13.875 s, 138.750 ns/access
>>>>Sequential access offset    32:  19.093 s, 190.930 ns/access
>>>>Sequential access offset    64:  19.156 s, 191.560 ns/access
>>>>Sequential access offset   128:  19.328 s, 193.280 ns/access
>>>>Sequential access offset   256:  19.719 s, 197.190 ns/access
>>>>Sequential access offset   512:  20.437 s, 204.370 ns/access
>>>>Sequential access offset  1024:  21.860 s, 218.600 ns/access
>>>>
>>>>So practical difference for computerchess :
>>>>
>>>>363 / 132 = 2.75 times faster latency for the opteron
>>>>
>>>>On die memory controller isn't that stupid nah?
>>>
>>>Never said it was.  I _did_ say that if you blow out the TLB on the K7 and on
>>>the Opteron, the average access times are close.
>>>
>>>raw latency on opteron is about 70ns to do _one_ memory read.  To read a random
>>>access word, where the TLB fails, requires 5 memory reads.  No way to avoid it,
>>>and it is going to cost 350ns.  _period_.  On the K7, average latency is about
>>>125ns to do _one_ memory read.  To read a random access word, where the TLB
>>>fails, requires 3 memory reads.  Or about 375ns.
>>>
>>>Those are _real_ numbers, reported by _many_ people including AMD.
>>
>>To do hashtable probes latency at opteron is as shown idem for k7.
>>
>>2.75 times faster on average. period.
>
>That "period" convinces me.  After all, you are _never_ wrong on this kind of
>stuff.  At least not until someone _else_ does the test for themselves..

BTW you quoted some _really_ bad data for a 4=way opteron.  I can guarantee you
that memory latency varies depending on which CPU accesses memory and how far
away it is.  Your test probably even uses a small enough memory and allocates
everything locally.  That really understates opteron access latency...


>
>
>
>>
>>
>>>
>>>I have no idea what your program above does, and really don't care.  But the
>>>opteron has a much bigger TLB, if you don't blow it out by referring to at least
>>>2048 different pages, then you are not comparing apples to apples.  Opteron has
>>>1024 TLB entries.  Enough to efficiently address 4 megs of RAM (1024 * 4kb
>>>pages).  Or if your O/S is smart enough, 2 gigs of ram with 1024 entries * 2M
>>>page size.
>>
>>>But for true non-TLB assisted random accesses, it is 350ns period.  There is
>>>absolutely no way to avoid the 4-level page translation lookup stuff.  Opteron
>>>ends up doing almost twice as many memory accesses as the K7.  Of course it can
>>>2^48 virtual addresses, and 2^40 real addresses in its present form so it has
>>>some advantages...
>>>
>>>
>>>>
>>>>>>I would prefer calling that 'streaming latency'. It's full name officially is
>>>>>>though 'cross bandwidth latency'.
>>>>>>
>>>>>>For chesssoftware that cross bandwidth latency is completely irrelevant.
>>>>>Not if you need to move blocks of data...
>>>>
>>>>That would make a funny chessprogram moving blocks of a few megabyte memory for
>>>>each node :)
>>>
>>>
>>>
>>>Don't have to move blocks of a few megabytes.  Just generating moves is enough
>>>to take advantage of sequential reads...
>>
>>May i remind you the processors have a L1 and L2 cache?
>>
>>L2 cache read = 13 cycles which is roughly 5.41 nanoseconds at 2.4ghz opteron.
>>
>>Note the L2 cache from opteron is faster than any other processor.
>
>Not true, but I won't go off on that tangent...
>
>
>
>Who cares?  Are you now off onto cache latency?  You _were_ talking about RAM
>latency.
>
>How about picking a topic and sticking with it?
>
>Whatever you say, there is _no_ way to prevent a random access read from doing 5
>memory accesses on the opteron.  _no_ way at all.
>
>If your test is broken so that missing TLB entries access MMU tables that are in
>L2, fine.  But that may or may not be probable depending on the program used...
>
>
>>
>>>
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>>>>
>>>>>>>>>>>The IID principle can also apply to some additional situations:
>>>>>>>>>>
>>>>>>>>>>>1) You have a hash move, but it's at depth-2 rather than depth-1. You can do
>>>>>>>>>>>another IID layer in this case.
>>>>>>>>>>
>>>>>>>>>>In that case hashmoves works better of course.
>>>>>>>>>>
>>>>>>>>>>>2) Your fail-high hash move (for some engines the only possible kind of hash
>>>>>>>>>>>move) fails low. Here you can do IID to get an alternative move.
>>>>>>>>>>
>>>>>>>>>>This is highly unlikely as your IID is at depth-i where i > 0.
>>>>>>>>>>
>>>>>>>>>>So most likely that hashmove is already from a position j >= depth - i, which
>>>>>>>>>>makes IID a complete waste of your time.
>>>>>>>>>
>>>>>>>>>I meant an IID where the move that already failed low is thrown out. You want
>>>>>>>>>the second-best move at the reduced depth.
>>>>>>>>
>>>>>>>>Use double nullmove. works better than IID and the first move you already get
>>>>>>>>the best move :)
>>>>>>>
>>>>>>>The depth reduction is too high. More experiments are needed - but it would be
>>>>>>>quite a coincidence if the best IID depth reduction just happened to be exactly
>>>>>>>twice the best null move depth reduction.
>>>>>>>>
>>>>>>>>>Usually, you will waste a few nodes this way of course. The idea is to avoid-the
>>>>>>>>>worst case scenario - of doing a full search through a bunch of other moves,
>>>>>>>>>before finding the fail-high move.
>>>>>>>>
>>>>>>>>You can add 1000 conditions, but if something doesn't work in general, it won't
>>>>>>>>work with 1000 conditions either. It just is harder to test in a way that
>>>>>>>>objective and statistical significant conclusions are possible to statistical
>>>>>>>>significant conclude whether it works or doesn't.
>>>>>>>>
>>>>>>>
>>>>>>>In Rybka, IID works. Further, I haven't found any conditions which make it work
>>>>>>>better, although I didn't try anything really fancy - just some comparisons
>>>>>>>between current eval and the bound. Anyway, I read your reply to Tord, and will
>>>>>>>keep retesting as the engine evolves.
>>>>>>
>>>>>>I didn't find a single condition under which it works for DIEP. It's just a
>>>>>>waste of system time IMHO.
>>>>>
>>>>>Too bad.  It works for me too.  Used very selectively.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>>>>
>>>>>>>>>>>And - as Tord mentioned - an IID search can be turned into the final
>>>>>>>>>>>reduced-depth search, based on its result.
>>>>>>>>>>>Vas
>>>>>>>>>>
>>>>>>>>>>Depth reducing the current search?
>>>>>>>>>>
>>>>>>>>>>Sounds like a rather bad idea to me.
>>>>>>>>>
>>>>>>>>>Well that's the million dollar question, isn't it?
>>>>>>>>
>>>>>>>>Seems there is 2 camps.
>>>>>>>>
>>>>>>>>I'm currently in the camp that i tried both worlds and concluded that depth
>>>>>>>>reducing with nullmove is already enough.
>>>>>>>>
>>>>>>>>I can imagine last few plies some types of forward pruning somehow work. So far
>>>>>>>>i could not prove that last though.
>>>>>>>>
>>>>>>>>I have a hard time believing that forward pruning in the entire tree is going to
>>>>>>>>beat the nullmove pruning.
>>>>>>>>
>>>>>>>>We both are titled chessplayers, and i see simply that the few mistakes todays
>>>>>>>>engines make, usually it is a dubious move caused by bugs in the forward
>>>>>>>>pruning.
>>>>>>>>
>>>>>>>>Shredder is clearest example.
>>>>>>>
>>>>>>>Yes Shredder has some blind spots, but it can also search really deep,
>>>>>>>especially when it's attacking. It's always nice to search deeper in the
>>>>>>>critical lines. Anyway - I'm still checking out both camps.
>>>>>>
>>>>>>Well it's not so hard to add 7 plies to your search depth because your
>>>>>>'selective search' might see 7 more (which in fact it does in diep).
>>>>>>
>>>>>>I prefer a 14 ply search depth with just nullmove above 18 with the chance that
>>>>>>all your search lines are depth reduced and last few plies you supernullmove and
>>>>>>in qsearch you lazy evaluate :)
>>>>>>
>>>>>>With forward pruning at every ply like shredder seems to do you only see faster
>>>>>>what it sees anyway. What your eval doesn't see, search won't find either
>>>>>>because you nonstop shorten such lines more than my '14 ply search depth' is
>>>>>>doing.
>>>>>>
>>>>>>>The key is to think of the future - because it will soon be here. I really don't
>>>>>>>care which search misses more tactics on some 32 bit 1 GHz machine ...
>>>>>>>Vas
>>>>>>
>>>>>>Last time Shredder ran on a 1 Ghz machine at a world champs was in world champs
>>>>>>London 2000, so those days are long gone.
>>>>>>
>>>>>>>>
>>>>>>>>>Vas



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.