Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Some new hyper-threading info. (correction)

Author: Anthony Cozzie

Date: 06:19:02 04/17/04

Go up one level in this thread


On April 17, 2004 at 05:27:23, Vasik Rajlich wrote:

>On April 16, 2004 at 23:15:34, Robert Hyatt wrote:
>
>>On April 16, 2004 at 22:57:41, Robert Hyatt wrote:
>>
>>>On April 16, 2004 at 20:05:47, Vincent Diepeveen wrote:
>>>
>>>>On April 16, 2004 at 14:32:32, Anthony Cozzie wrote:
>>>>
>>>>>On April 16, 2004 at 14:27:55, Gerd Isenberg wrote:
>>>>>
>>>>>>On April 16, 2004 at 12:40:21, Robert Hyatt wrote:
>>>>>><snip>
>>>>>>>>>>So a possible alternative to evaltable would be hashing in qsearch.
>>>>>>>>>>
>>>>>>>>>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at
>>>>>>>>>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>Depends.  Use big memory pages and it costs 150 ns.  No TLB thrashing then.
>>>>>>>>
>>>>>>>>We all know that when things are in L2 cache it's faster. However when using
>>>>>>>>hashtables by definition you are busy with TLB trashing.
>>>>>>>
>>>>>>>Wrong.
>>>>>>>
>>>>>>>Do you know what the TLB does?  Do you know how big it is?  Do you know what
>>>>>>>going from 4KB to 2MB/4MB page sizes does to that?
>>>>>>>
>>>>>>>Didn't think so...
>>>>>>>
>>>>>>>>
>>>>>>>>Use big hashtables start doing lookups to your hashtable and it costs on average
>>>>>>>>400 ns on a dual k7 and your dual Xeon. Saying that under conditions A and B,
>>>>>>>>which hardly happen, that it is 150ns makes no sense. When it's in L2 cache at
>>>>>>>>opteron it just costs 13 cycles. Same type of comparision.
>>>>>>>
>>>>>>>Using 4mb pages, my hash probes do _not_ take 400ns.  You can say it all you
>>>>>>>want, but it will never be true.  The proof is intuitive for anyone
>>>>>>>understanding the relationship between number of virtual pages and TLB size.
>>>>>>>
>>>>>>
>>>>>>Hi Bob,
>>>>>>
>>>>>>I guess you are talking about P4/Xeon.
>>>>>>
>>>>>>What i read so far about opteron, there is a two level Data-TLB as "part" of the
>>>>>>L1-Data Cache (1024 - 64 Byte cache lines), which maps the most-recently-used
>>>>>>virtual addresses to their physical addresses. The primary TLB has 40 entries,
>>>>>>32 for 4KB pages, only eight for 2MB pages. The secondary contains 512 entries
>>>>>>for 4KB pages only. De Vries explains the "expensive" table-walk very
>>>>>>instructive.
>>>>>>
>>>>>>Do you believe, with huge random access tables, let say >= 512MB,
>>>>>>that eight 2MB pages helps to avoid TLB trashing?
>>>>>>
>>>>>>Are there special OS-dependent mallocs to get those huge pages?
>>>>>>What about using one 2M page for some combined CONST or DATA-segments?
>>>>>>It would be nice to guide the linker that way.
>>>>>>
>>>>>>Thanks,
>>>>>>Gerd
>>>>>>
>>>>>>
>>>>>>some references:
>>>>>>
>>>>>>Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™
>>>>>>Processors
>>>>>>Appendix A Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors
>>>>>>A.9 Translation-Lookaside Buffer
>>>>>>
>>>>>>
>>>>>>Understanding the detailed Architecture of AMD's 64 bit Core
>>>>>>                 by Hans de Vries
>>>>>>
>>>>>>http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
>>>>>>
>>>>>>3.3 The Data Cache Hit / Miss Detection: The cache tags and the primairy TLB's
>>>>>>3.4 The 512 entry second level TLB
>>>>>>3.16 The TLB Flush Filter CAM
>>>>>
>>>>>
>>>>>Opteron blows in this regard, but I believe that all 64(?) of the P4's TLB
>>>>>entries can be toggled to large pages.
>>>>>
>>>>>Also, it is worth nothing that the opterons current TLB supports only 4MB of
>>>>>memory using the 4KB pages, so 8 MB is still an improvement :)  Hopefully when
>>>>>AMD transisitions to the 90nm process the new core will fix this.
>>>>>
>>>>>anthony
>>>>
>>>>Opteron doesn't blow at all. Just test the speed you can get data to you at
>>>>opteron.
>>>>
>>>>It's 2.5 times faster than Xeon in that respect.
>>>
>>>
>>>Wrong.  Just do the math.
>>>
>>>4-way opteron.  basic latency will be about 70ns for local memory, 140 for two
>>>close blocks of non-local memory, and 210 for that last block of non-local
>>>memory.
>>>
>>>Intel = 150ns period.  Thrash the TLB and it can go to 450ns for the two-level
>>>map.  Shrink to 4mb pages and this becomes a 1-level map with a max latency of
>>>300ns with thrashed TLB, or 150 if the TLB is pretty effective.
>>>
>>>Now thrash the TLB on the opteron and every memory reference will require 4
>>>memory references (If I read right, since opteron uses 48 bit virtual address
>>>space, the map is broken into three parts).
>>
>>The above is wrong.  The memory mapping table is broken into _four_ parts, not
>>three.  Add another 70ns to 210ns to all my estimates...  When I originally
>>looked at this a couple of months back, I read "four" to be 3 map lookups plus
>>the reference to fetch the data after the address is mapped.  It really is 4 map
>>lookups plus another to fetch the actual data.
>>
>>Benefit is that opteron has (at present) a 40 bit physical address space, and a
>>48 bit virtual address space.  Intel is 32 bit virtual per process, 36 bit
>>physical address space...
>>
>>
>>> Best case is 280ns for four probes
>>>to local memory.  Worst case is 840ns for four probes to most remote memory.
>>>You could make the O/S replicate the memory map into local memory for all the
>>>processors (4x memory waste) and limit this to 210ns for mapping plus the
>>>70-210ns cost to actually fetch the data.
>>>
>>>Intel's big pages help quite a bit.  Only problem is that the intel TLB is not
>>>_that_ big.  But 4mb pages reduce the requirement.  AMD will fix this soon...
>>>
>>>The opteron advantage evaporates quickly, when you use big pages on Intel...
>>>
>>>Processor is still very good of course.  But it will get even better when this
>>>is cleaned up.
>
>I wonder if there could be some way to start this fetching process earlier.
>
>For example, as soon as you generate a move, you instruct/fool the processor
>into starting the hash entry fetch right there, while normal execution continues
>in parallel until it needs this data.
>
>Also - it wasn't clear to me - if the data you are looking for is in a 2 M page,
>are you skipping steps 1-3 of the table talk, or only step 3? (leaving three
>steps)
>
>If you skip steps 1-3, I can't imagine why AMD would only have 8 of these
>entries.
>
>Vas

I have looked into this extensively.  The problem is that if it misses in the
TLB, it gives up.  So if you use big pages you can cut your transposition table
latency to almost zero with some prefetching in makemove.

anthony



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.