Author: Anthony Cozzie
Date: 07:54:05 04/18/04
Go up one level in this thread
On April 18, 2004 at 07:25:58, Vasik Rajlich wrote: >On April 17, 2004 at 09:19:02, Anthony Cozzie wrote: > >>On April 17, 2004 at 05:27:23, Vasik Rajlich wrote: >> >>>On April 16, 2004 at 23:15:34, Robert Hyatt wrote: >>> >>>>On April 16, 2004 at 22:57:41, Robert Hyatt wrote: >>>> >>>>>On April 16, 2004 at 20:05:47, Vincent Diepeveen wrote: >>>>> >>>>>>On April 16, 2004 at 14:32:32, Anthony Cozzie wrote: >>>>>> >>>>>>>On April 16, 2004 at 14:27:55, Gerd Isenberg wrote: >>>>>>> >>>>>>>>On April 16, 2004 at 12:40:21, Robert Hyatt wrote: >>>>>>>><snip> >>>>>>>>>>>>So a possible alternative to evaltable would be hashing in qsearch. >>>>>>>>>>>> >>>>>>>>>>>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at >>>>>>>>>>>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>Depends. Use big memory pages and it costs 150 ns. No TLB thrashing then. >>>>>>>>>> >>>>>>>>>>We all know that when things are in L2 cache it's faster. However when using >>>>>>>>>>hashtables by definition you are busy with TLB trashing. >>>>>>>>> >>>>>>>>>Wrong. >>>>>>>>> >>>>>>>>>Do you know what the TLB does? Do you know how big it is? Do you know what >>>>>>>>>going from 4KB to 2MB/4MB page sizes does to that? >>>>>>>>> >>>>>>>>>Didn't think so... >>>>>>>>> >>>>>>>>>> >>>>>>>>>>Use big hashtables start doing lookups to your hashtable and it costs on average >>>>>>>>>>400 ns on a dual k7 and your dual Xeon. Saying that under conditions A and B, >>>>>>>>>>which hardly happen, that it is 150ns makes no sense. When it's in L2 cache at >>>>>>>>>>opteron it just costs 13 cycles. Same type of comparision. >>>>>>>>> >>>>>>>>>Using 4mb pages, my hash probes do _not_ take 400ns. You can say it all you >>>>>>>>>want, but it will never be true. The proof is intuitive for anyone >>>>>>>>>understanding the relationship between number of virtual pages and TLB size. >>>>>>>>> >>>>>>>> >>>>>>>>Hi Bob, >>>>>>>> >>>>>>>>I guess you are talking about P4/Xeon. >>>>>>>> >>>>>>>>What i read so far about opteron, there is a two level Data-TLB as "part" of the >>>>>>>>L1-Data Cache (1024 - 64 Byte cache lines), which maps the most-recently-used >>>>>>>>virtual addresses to their physical addresses. The primary TLB has 40 entries, >>>>>>>>32 for 4KB pages, only eight for 2MB pages. The secondary contains 512 entries >>>>>>>>for 4KB pages only. De Vries explains the "expensive" table-walk very >>>>>>>>instructive. >>>>>>>> >>>>>>>>Do you believe, with huge random access tables, let say >= 512MB, >>>>>>>>that eight 2MB pages helps to avoid TLB trashing? >>>>>>>> >>>>>>>>Are there special OS-dependent mallocs to get those huge pages? >>>>>>>>What about using one 2M page for some combined CONST or DATA-segments? >>>>>>>>It would be nice to guide the linker that way. >>>>>>>> >>>>>>>>Thanks, >>>>>>>>Gerd >>>>>>>> >>>>>>>> >>>>>>>>some references: >>>>>>>> >>>>>>>>Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ >>>>>>>>Processors >>>>>>>>Appendix A Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors >>>>>>>>A.9 Translation-Lookaside Buffer >>>>>>>> >>>>>>>> >>>>>>>>Understanding the detailed Architecture of AMD's 64 bit Core >>>>>>>> by Hans de Vries >>>>>>>> >>>>>>>>http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html >>>>>>>> >>>>>>>>3.3 The Data Cache Hit / Miss Detection: The cache tags and the primairy TLB's >>>>>>>>3.4 The 512 entry second level TLB >>>>>>>>3.16 The TLB Flush Filter CAM >>>>>>> >>>>>>> >>>>>>>Opteron blows in this regard, but I believe that all 64(?) of the P4's TLB >>>>>>>entries can be toggled to large pages. >>>>>>> >>>>>>>Also, it is worth nothing that the opterons current TLB supports only 4MB of >>>>>>>memory using the 4KB pages, so 8 MB is still an improvement :) Hopefully when >>>>>>>AMD transisitions to the 90nm process the new core will fix this. >>>>>>> >>>>>>>anthony >>>>>> >>>>>>Opteron doesn't blow at all. Just test the speed you can get data to you at >>>>>>opteron. >>>>>> >>>>>>It's 2.5 times faster than Xeon in that respect. >>>>> >>>>> >>>>>Wrong. Just do the math. >>>>> >>>>>4-way opteron. basic latency will be about 70ns for local memory, 140 for two >>>>>close blocks of non-local memory, and 210 for that last block of non-local >>>>>memory. >>>>> >>>>>Intel = 150ns period. Thrash the TLB and it can go to 450ns for the two-level >>>>>map. Shrink to 4mb pages and this becomes a 1-level map with a max latency of >>>>>300ns with thrashed TLB, or 150 if the TLB is pretty effective. >>>>> >>>>>Now thrash the TLB on the opteron and every memory reference will require 4 >>>>>memory references (If I read right, since opteron uses 48 bit virtual address >>>>>space, the map is broken into three parts). >>>> >>>>The above is wrong. The memory mapping table is broken into _four_ parts, not >>>>three. Add another 70ns to 210ns to all my estimates... When I originally >>>>looked at this a couple of months back, I read "four" to be 3 map lookups plus >>>>the reference to fetch the data after the address is mapped. It really is 4 map >>>>lookups plus another to fetch the actual data. >>>> >>>>Benefit is that opteron has (at present) a 40 bit physical address space, and a >>>>48 bit virtual address space. Intel is 32 bit virtual per process, 36 bit >>>>physical address space... >>>> >>>> >>>>> Best case is 280ns for four probes >>>>>to local memory. Worst case is 840ns for four probes to most remote memory. >>>>>You could make the O/S replicate the memory map into local memory for all the >>>>>processors (4x memory waste) and limit this to 210ns for mapping plus the >>>>>70-210ns cost to actually fetch the data. >>>>> >>>>>Intel's big pages help quite a bit. Only problem is that the intel TLB is not >>>>>_that_ big. But 4mb pages reduce the requirement. AMD will fix this soon... >>>>> >>>>>The opteron advantage evaporates quickly, when you use big pages on Intel... >>>>> >>>>>Processor is still very good of course. But it will get even better when this >>>>>is cleaned up. >>> >>>I wonder if there could be some way to start this fetching process earlier. >>> >>>For example, as soon as you generate a move, you instruct/fool the processor >>>into starting the hash entry fetch right there, while normal execution continues >>>in parallel until it needs this data. >>> >>>Also - it wasn't clear to me - if the data you are looking for is in a 2 M page, >>>are you skipping steps 1-3 of the table talk, or only step 3? (leaving three >>>steps) >>> >>>If you skip steps 1-3, I can't imagine why AMD would only have 8 of these >>>entries. >>> >>>Vas >> >>I have looked into this extensively. The problem is that if it misses in the >>TLB, it gives up. So if you use big pages you can cut your transposition table >>latency to almost zero with some prefetching in makemove. >> >>anthony > >How did you conclude this? By testing, or reading? > >I looked around a little, it wasn't mentioned anywhere. > >http://www.amd.com/us-en/assets/content_type/DownloadableAssets/Optimization_-_Tim_Wilkens.pdf > >pages 6 & 17 > > >http://www.amd.com/us-en/assets/content_type/DownloadableAssets/dwamd_AMD_GDC_2004_MW.pdf > >page 18 > > >etc .. > >But now I can see how Fritz ended up being coded in assembly :-) > >Vas I'm not finding it either :( I distinctly remember reading this somewhere, but you are welcome to try it yourself. Correct me if I am wrong, but if it is not in the TLB, the page might be on the hard disk and require a system call. That might explain it. Vincent has challenged me to prove my theory that hugetlb is better, so after I do that I may try prefetching again. I should be able to prove something one way or another. anthony
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.