Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Some new hyper-threading info. (correction again)

Author: Robert Hyatt
Date: 10:35:42 04/18/04
On April 18, 2004 at 13:18:36, Robert Hyatt wrote:

>On April 17, 2004 at 16:47:22, Vincent Diepeveen wrote:
>
>>On April 16, 2004 at 22:57:41, Robert Hyatt wrote:
>>
>>the only interesting thing for us is how fast it is when doing hashtable
>>lookups.
>>
>>all the bandwidth latencies you mention here is not workable data.
>>
>>you need to know as a programmer what the time is to get your hashtable entries.
>>
>>This you can test easily with more than 2 different testprograms. One from me,
>>one from dieter.
>>
>>both show the same result.
>>
>>Opteron has a 2.5 faster latency there.
>
>And exactly what opteron did you test on?  I gave numbers for a 4 cpu system.
>
>random access latency on my 4-way 700mhz xeon is 125ns to  375ns.  125 ns if not
>thrashing the TLB, 375 if the TLB can't keep up.
>
>4-way 2.2ghz opteron had 70-210ns latency with no TLB thrashing.  280-840ns
>latency if TLB is blasted.

The above is wrong  70-210 is correct.  280 to 840 is not.  It should be
350-1050.  The opteron requires _four_ extra memory reads to map a virtual to a
real address if the TLB fails to provide a quick result.  Intel only needs _two_
extra reads.  The 2 extra reads offset the faster latency.

single processor intel is 375ns latency for random access where TLB is blown,
125 where it is not.

Single processor Opteron is 350ns latency for random access where TLB is blown,
70ns where it is not.

Your 2.5x is simply wrong for all cases except where your virtual address space
is small enough to not blow the Opteron's TLB.

Best case (both machines) is 125ns to 70ns, no TLB problems.

Worst case (both machines) is 375ns to 350ns, TLB thrashed.

Only in worst (intel) vs best (AMD) do you get a big difference, 375 with Intel
TLB thrashing vs 70 with AMD no thrashing.

You need to re-compute your numbers _correctly_.

On a real 84x opteron.



>
>That is _not_ 2.5X.
>
>And those are actual numbers from a real machine.
>
>
>
>
>>
>>The rest is just irrelevant.
>
>Are you talking about your data?  Again, what opteron did you run on?  I can
>give you the serial number of the one I tested, as well as for the quad 700 xeon
>box I still have...
>
>
>
>
>
>>
>>>On April 16, 2004 at 20:05:47, Vincent Diepeveen wrote:
>>>
>>>>On April 16, 2004 at 14:32:32, Anthony Cozzie wrote:
>>>>
>>>>>On April 16, 2004 at 14:27:55, Gerd Isenberg wrote:
>>>>>
>>>>>>On April 16, 2004 at 12:40:21, Robert Hyatt wrote:
>>>>>><snip>
>>>>>>>>>>So a possible alternative to evaltable would be hashing in qsearch.
>>>>>>>>>>
>>>>>>>>>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at
>>>>>>>>>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>Depends.  Use big memory pages and it costs 150 ns.  No TLB thrashing then.
>>>>>>>>
>>>>>>>>We all know that when things are in L2 cache it's faster. However when using
>>>>>>>>hashtables by definition you are busy with TLB trashing.
>>>>>>>
>>>>>>>Wrong.
>>>>>>>
>>>>>>>Do you know what the TLB does?  Do you know how big it is?  Do you know what
>>>>>>>going from 4KB to 2MB/4MB page sizes does to that?
>>>>>>>
>>>>>>>Didn't think so...
>>>>>>>
>>>>>>>>
>>>>>>>>Use big hashtables start doing lookups to your hashtable and it costs on average
>>>>>>>>400 ns on a dual k7 and your dual Xeon. Saying that under conditions A and B,
>>>>>>>>which hardly happen, that it is 150ns makes no sense. When it's in L2 cache at
>>>>>>>>opteron it just costs 13 cycles. Same type of comparision.
>>>>>>>
>>>>>>>Using 4mb pages, my hash probes do _not_ take 400ns.  You can say it all you
>>>>>>>want, but it will never be true.  The proof is intuitive for anyone
>>>>>>>understanding the relationship between number of virtual pages and TLB size.
>>>>>>>
>>>>>>
>>>>>>Hi Bob,
>>>>>>
>>>>>>I guess you are talking about P4/Xeon.
>>>>>>
>>>>>>What i read so far about opteron, there is a two level Data-TLB as "part" of the
>>>>>>L1-Data Cache (1024 - 64 Byte cache lines), which maps the most-recently-used
>>>>>>virtual addresses to their physical addresses. The primary TLB has 40 entries,
>>>>>>32 for 4KB pages, only eight for 2MB pages. The secondary contains 512 entries
>>>>>>for 4KB pages only. De Vries explains the "expensive" table-walk very
>>>>>>instructive.
>>>>>>
>>>>>>Do you believe, with huge random access tables, let say >= 512MB,
>>>>>>that eight 2MB pages helps to avoid TLB trashing?
>>>>>>
>>>>>>Are there special OS-dependent mallocs to get those huge pages?
>>>>>>What about using one 2M page for some combined CONST or DATA-segments?
>>>>>>It would be nice to guide the linker that way.
>>>>>>
>>>>>>Thanks,
>>>>>>Gerd
>>>>>>
>>>>>>
>>>>>>some references:
>>>>>>
>>>>>>Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™
>>>>>>Processors
>>>>>>Appendix A Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors
>>>>>>A.9 Translation-Lookaside Buffer
>>>>>>
>>>>>>
>>>>>>Understanding the detailed Architecture of AMD's 64 bit Core
>>>>>>                 by Hans de Vries
>>>>>>
>>>>>>http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
>>>>>>
>>>>>>3.3 The Data Cache Hit / Miss Detection: The cache tags and the primairy TLB's
>>>>>>3.4 The 512 entry second level TLB
>>>>>>3.16 The TLB Flush Filter CAM
>>>>>
>>>>>
>>>>>Opteron blows in this regard, but I believe that all 64(?) of the P4's TLB
>>>>>entries can be toggled to large pages.
>>>>>
>>>>>Also, it is worth nothing that the opterons current TLB supports only 4MB of
>>>>>memory using the 4KB pages, so 8 MB is still an improvement :)  Hopefully when
>>>>>AMD transisitions to the 90nm process the new core will fix this.
>>>>>
>>>>>anthony
>>>>
>>>>Opteron doesn't blow at all. Just test the speed you can get data to you at
>>>>opteron.
>>>>
>>>>It's 2.5 times faster than Xeon in that respect.
>>>
>>>
>>>Wrong.  Just do the math.
>>>
>>>4-way opteron.  basic latency will be about 70ns for local memory, 140 for two
>>>close blocks of non-local memory, and 210 for that last block of non-local
>>>memory.
>>>
>>>Intel = 150ns period.  Thrash the TLB and it can go to 450ns for the two-level
>>>map.  Shrink to 4mb pages and this becomes a 1-level map with a max latency of
>>>300ns with thrashed TLB, or 150 if the TLB is pretty effective.
>>>
>>>Now thrash the TLB on the opteron and every memory reference will require 4
>>>memory references (If I read right, since opteron uses 48 bit virtual address
>>>space, the map is broken into three parts).  Best case is 280ns for four probes
>>>to local memory.  Worst case is 840ns for four probes to most remote memory.
>>>You could make the O/S replicate the memory map into local memory for all the
>>>processors (4x memory waste) and limit this to 210ns for mapping plus the
>>>70-210ns cost to actually fetch the data.
>>>
>>>Intel's big pages help quite a bit.  Only problem is that the intel TLB is not
>>>_that_ big.  But 4mb pages reduce the requirement.  AMD will fix this soon...
>>>
>>>The opteron advantage evaporates quickly, when you use big pages on Intel...
>>>
>>>Processor is still very good of course.  But it will get even better when this
>>>is cleaned up.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.