Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Some new hyper-threading info.

Author: Robert Hyatt

Date: 19:47:01 04/16/04

Go up one level in this thread


On April 16, 2004 at 14:27:55, Gerd Isenberg wrote:

>On April 16, 2004 at 12:40:21, Robert Hyatt wrote:
><snip>
>>>>>So a possible alternative to evaltable would be hashing in qsearch.
>>>>>
>>>>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at
>>>>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns.
>>>>>
>>>>
>>>>
>>>>
>>>>Depends.  Use big memory pages and it costs 150 ns.  No TLB thrashing then.
>>>
>>>We all know that when things are in L2 cache it's faster. However when using
>>>hashtables by definition you are busy with TLB trashing.
>>
>>Wrong.
>>
>>Do you know what the TLB does?  Do you know how big it is?  Do you know what
>>going from 4KB to 2MB/4MB page sizes does to that?
>>
>>Didn't think so...
>>
>>>
>>>Use big hashtables start doing lookups to your hashtable and it costs on average
>>>400 ns on a dual k7 and your dual Xeon. Saying that under conditions A and B,
>>>which hardly happen, that it is 150ns makes no sense. When it's in L2 cache at
>>>opteron it just costs 13 cycles. Same type of comparision.
>>
>>Using 4mb pages, my hash probes do _not_ take 400ns.  You can say it all you
>>want, but it will never be true.  The proof is intuitive for anyone
>>understanding the relationship between number of virtual pages and TLB size.
>>
>
>Hi Bob,
>
>I guess you are talking about P4/Xeon.

Yes and No.  Xeon is what I have tested the "big pages" on of course.

>
>What i read so far about opteron, there is a two level Data-TLB as "part" of the
>L1-Data Cache (1024 - 64 Byte cache lines), which maps the most-recently-used
>virtual addresses to their physical addresses. The primary TLB has 40 entries,
>32 for 4KB pages, only eight for 2MB pages. The secondary contains 512 entries
>for 4KB pages only. De Vries explains the "expensive" table-walk very
>instructive.
>
>Do you believe, with huge random access tables, let say >= 512MB,
>that eight 2MB pages helps to avoid TLB trashing?

No, but then the question is, is that the way it works when you _only_ use 2mb
pages.  I don't have an opteron handy yet.  We have a 32-node dual opteron
cluster on order and are expecting it within a month or so.  Perhaps then I can
run some real tests to see what they are doing...

But Vincent was specifically quoting 400ns on Intel boxes.  And that may or may
not be correct depending on page size...



>
>Are there special OS-dependent mallocs to get those huge pages?

Not in Linux.  I used a patched kernel when I tested this last year.  There was
talk of putting part of that into stock Linux but I didn't follow the discussion
much...


>What about using one 2M page for some combined CONST or DATA-segments?
>It would be nice to guide the linker that way.


Yes, although it would kill portability as you'd have to know the exact platform
you run on...


>
>Thanks,
>Gerd
>
>
>some references:
>
>Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™
>Processors
>Appendix A Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors
>A.9 Translation-Lookaside Buffer
>
>
>Understanding the detailed Architecture of AMD's 64 bit Core
>                 by Hans de Vries
>
>http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
>
>3.3 The Data Cache Hit / Miss Detection: The cache tags and the primairy TLB's
>3.4 The 512 entry second level TLB
>3.16 The TLB Flush Filter CAM



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.