Author: Robert Hyatt
Date: 10:18:36 04/18/04
Go up one level in this thread
On April 17, 2004 at 16:47:22, Vincent Diepeveen wrote: >On April 16, 2004 at 22:57:41, Robert Hyatt wrote: > >the only interesting thing for us is how fast it is when doing hashtable >lookups. > >all the bandwidth latencies you mention here is not workable data. > >you need to know as a programmer what the time is to get your hashtable entries. > >This you can test easily with more than 2 different testprograms. One from me, >one from dieter. > >both show the same result. > >Opteron has a 2.5 faster latency there. And exactly what opteron did you test on? I gave numbers for a 4 cpu system. random access latency on my 4-way 700mhz xeon is 125ns to 375ns. 125 ns if not thrashing the TLB, 375 if the TLB can't keep up. 4-way 2.2ghz opteron had 70-210ns latency with no TLB thrashing. 280-840ns latency if TLB is blasted. That is _not_ 2.5X. And those are actual numbers from a real machine. > >The rest is just irrelevant. Are you talking about your data? Again, what opteron did you run on? I can give you the serial number of the one I tested, as well as for the quad 700 xeon box I still have... > >>On April 16, 2004 at 20:05:47, Vincent Diepeveen wrote: >> >>>On April 16, 2004 at 14:32:32, Anthony Cozzie wrote: >>> >>>>On April 16, 2004 at 14:27:55, Gerd Isenberg wrote: >>>> >>>>>On April 16, 2004 at 12:40:21, Robert Hyatt wrote: >>>>><snip> >>>>>>>>>So a possible alternative to evaltable would be hashing in qsearch. >>>>>>>>> >>>>>>>>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at >>>>>>>>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>Depends. Use big memory pages and it costs 150 ns. No TLB thrashing then. >>>>>>> >>>>>>>We all know that when things are in L2 cache it's faster. However when using >>>>>>>hashtables by definition you are busy with TLB trashing. >>>>>> >>>>>>Wrong. >>>>>> >>>>>>Do you know what the TLB does? Do you know how big it is? Do you know what >>>>>>going from 4KB to 2MB/4MB page sizes does to that? >>>>>> >>>>>>Didn't think so... >>>>>> >>>>>>> >>>>>>>Use big hashtables start doing lookups to your hashtable and it costs on average >>>>>>>400 ns on a dual k7 and your dual Xeon. Saying that under conditions A and B, >>>>>>>which hardly happen, that it is 150ns makes no sense. When it's in L2 cache at >>>>>>>opteron it just costs 13 cycles. Same type of comparision. >>>>>> >>>>>>Using 4mb pages, my hash probes do _not_ take 400ns. You can say it all you >>>>>>want, but it will never be true. The proof is intuitive for anyone >>>>>>understanding the relationship between number of virtual pages and TLB size. >>>>>> >>>>> >>>>>Hi Bob, >>>>> >>>>>I guess you are talking about P4/Xeon. >>>>> >>>>>What i read so far about opteron, there is a two level Data-TLB as "part" of the >>>>>L1-Data Cache (1024 - 64 Byte cache lines), which maps the most-recently-used >>>>>virtual addresses to their physical addresses. The primary TLB has 40 entries, >>>>>32 for 4KB pages, only eight for 2MB pages. The secondary contains 512 entries >>>>>for 4KB pages only. De Vries explains the "expensive" table-walk very >>>>>instructive. >>>>> >>>>>Do you believe, with huge random access tables, let say >= 512MB, >>>>>that eight 2MB pages helps to avoid TLB trashing? >>>>> >>>>>Are there special OS-dependent mallocs to get those huge pages? >>>>>What about using one 2M page for some combined CONST or DATA-segments? >>>>>It would be nice to guide the linker that way. >>>>> >>>>>Thanks, >>>>>Gerd >>>>> >>>>> >>>>>some references: >>>>> >>>>>Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ >>>>>Processors >>>>>Appendix A Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors >>>>>A.9 Translation-Lookaside Buffer >>>>> >>>>> >>>>>Understanding the detailed Architecture of AMD's 64 bit Core >>>>> by Hans de Vries >>>>> >>>>>http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html >>>>> >>>>>3.3 The Data Cache Hit / Miss Detection: The cache tags and the primairy TLB's >>>>>3.4 The 512 entry second level TLB >>>>>3.16 The TLB Flush Filter CAM >>>> >>>> >>>>Opteron blows in this regard, but I believe that all 64(?) of the P4's TLB >>>>entries can be toggled to large pages. >>>> >>>>Also, it is worth nothing that the opterons current TLB supports only 4MB of >>>>memory using the 4KB pages, so 8 MB is still an improvement :) Hopefully when >>>>AMD transisitions to the 90nm process the new core will fix this. >>>> >>>>anthony >>> >>>Opteron doesn't blow at all. Just test the speed you can get data to you at >>>opteron. >>> >>>It's 2.5 times faster than Xeon in that respect. >> >> >>Wrong. Just do the math. >> >>4-way opteron. basic latency will be about 70ns for local memory, 140 for two >>close blocks of non-local memory, and 210 for that last block of non-local >>memory. >> >>Intel = 150ns period. Thrash the TLB and it can go to 450ns for the two-level >>map. Shrink to 4mb pages and this becomes a 1-level map with a max latency of >>300ns with thrashed TLB, or 150 if the TLB is pretty effective. >> >>Now thrash the TLB on the opteron and every memory reference will require 4 >>memory references (If I read right, since opteron uses 48 bit virtual address >>space, the map is broken into three parts). Best case is 280ns for four probes >>to local memory. Worst case is 840ns for four probes to most remote memory. >>You could make the O/S replicate the memory map into local memory for all the >>processors (4x memory waste) and limit this to 210ns for mapping plus the >>70-210ns cost to actually fetch the data. >> >>Intel's big pages help quite a bit. Only problem is that the intel TLB is not >>_that_ big. But 4mb pages reduce the requirement. AMD will fix this soon... >> >>The opteron advantage evaporates quickly, when you use big pages on Intel... >> >>Processor is still very good of course. But it will get even better when this >>is cleaned up.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.