Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Some new hyper-threading info.

Author: Robert Hyatt
Date: 09:39:57 04/17/04
On April 17, 2004 at 09:22:31, Anthony Cozzie wrote:

>On April 16, 2004 at 22:57:41, Robert Hyatt wrote:
>
>>On April 16, 2004 at 20:05:47, Vincent Diepeveen wrote:
>>
>>>On April 16, 2004 at 14:32:32, Anthony Cozzie wrote:
>>>
>>>>On April 16, 2004 at 14:27:55, Gerd Isenberg wrote:
>>>>
>>>>>On April 16, 2004 at 12:40:21, Robert Hyatt wrote:
>>>>><snip>
>>>>>>>>>So a possible alternative to evaltable would be hashing in qsearch.
>>>>>>>>>
>>>>>>>>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at
>>>>>>>>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>Depends.  Use big memory pages and it costs 150 ns.  No TLB thrashing then.
>>>>>>>
>>>>>>>We all know that when things are in L2 cache it's faster. However when using
>>>>>>>hashtables by definition you are busy with TLB trashing.
>>>>>>
>>>>>>Wrong.
>>>>>>
>>>>>>Do you know what the TLB does?  Do you know how big it is?  Do you know what
>>>>>>going from 4KB to 2MB/4MB page sizes does to that?
>>>>>>
>>>>>>Didn't think so...
>>>>>>
>>>>>>>
>>>>>>>Use big hashtables start doing lookups to your hashtable and it costs on average
>>>>>>>400 ns on a dual k7 and your dual Xeon. Saying that under conditions A and B,
>>>>>>>which hardly happen, that it is 150ns makes no sense. When it's in L2 cache at
>>>>>>>opteron it just costs 13 cycles. Same type of comparision.
>>>>>>
>>>>>>Using 4mb pages, my hash probes do _not_ take 400ns.  You can say it all you
>>>>>>want, but it will never be true.  The proof is intuitive for anyone
>>>>>>understanding the relationship between number of virtual pages and TLB size.
>>>>>>
>>>>>
>>>>>Hi Bob,
>>>>>
>>>>>I guess you are talking about P4/Xeon.
>>>>>
>>>>>What i read so far about opteron, there is a two level Data-TLB as "part" of the
>>>>>L1-Data Cache (1024 - 64 Byte cache lines), which maps the most-recently-used
>>>>>virtual addresses to their physical addresses. The primary TLB has 40 entries,
>>>>>32 for 4KB pages, only eight for 2MB pages. The secondary contains 512 entries
>>>>>for 4KB pages only. De Vries explains the "expensive" table-walk very
>>>>>instructive.
>>>>>
>>>>>Do you believe, with huge random access tables, let say >= 512MB,
>>>>>that eight 2MB pages helps to avoid TLB trashing?
>>>>>
>>>>>Are there special OS-dependent mallocs to get those huge pages?
>>>>>What about using one 2M page for some combined CONST or DATA-segments?
>>>>>It would be nice to guide the linker that way.
>>>>>
>>>>>Thanks,
>>>>>Gerd
>>>>>
>>>>>
>>>>>some references:
>>>>>
>>>>>Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™
>>>>>Processors
>>>>>Appendix A Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors
>>>>>A.9 Translation-Lookaside Buffer
>>>>>
>>>>>
>>>>>Understanding the detailed Architecture of AMD's 64 bit Core
>>>>>                 by Hans de Vries
>>>>>
>>>>>http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
>>>>>
>>>>>3.3 The Data Cache Hit / Miss Detection: The cache tags and the primairy TLB's
>>>>>3.4 The 512 entry second level TLB
>>>>>3.16 The TLB Flush Filter CAM
>>>>
>>>>
>>>>Opteron blows in this regard, but I believe that all 64(?) of the P4's TLB
>>>>entries can be toggled to large pages.
>>>>
>>>>Also, it is worth nothing that the opterons current TLB supports only 4MB of
>>>>memory using the 4KB pages, so 8 MB is still an improvement :)  Hopefully when
>>>>AMD transisitions to the 90nm process the new core will fix this.
>>>>
>>>>anthony
>>>
>>>Opteron doesn't blow at all. Just test the speed you can get data to you at
>>>opteron.
>>>
>>>It's 2.5 times faster than Xeon in that respect.
>>
>>
>>Wrong.  Just do the math.
>>
>>4-way opteron.  basic latency will be about 70ns for local memory, 140 for two
>>close blocks of non-local memory, and 210 for that last block of non-local
>>memory.
>>
>>Intel = 150ns period.  Thrash the TLB and it can go to 450ns for the two-level
>>map.  Shrink to 4mb pages and this becomes a 1-level map with a max latency of
>>300ns with thrashed TLB, or 150 if the TLB is pretty effective.
>>
>>Now thrash the TLB on the opteron and every memory reference will require 4
>>memory references (If I read right, since opteron uses 48 bit virtual address
>>space, the map is broken into three parts).  Best case is 280ns for four probes
>>to local memory.  Worst case is 840ns for four probes to most remote memory.
>>You could make the O/S replicate the memory map into local memory for all the
>>processors (4x memory waste) and limit this to 210ns for mapping plus the
>>70-210ns cost to actually fetch the data.
>>
>>Intel's big pages help quite a bit.  Only problem is that the intel TLB is not
>>_that_ big.  But 4mb pages reduce the requirement.  AMD will fix this soon...
>>
>>The opteron advantage evaporates quickly, when you use big pages on Intel...
>>
>
>On most things the opteron blows the P4 out of the water, but you have to wonder
>why they didn't fix this.  Its intended to be a server CPU -> lots of memory.
>
>anthony

I would agree.  Even with this memory weakness it is far beyond any PIV for any
test I have seen.  But it will probably be even better when this gets fixed...



>
>>Processor is still very good of course.  But it will get even better when this
>>is cleaned up.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.