Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Some new hyper-threading info. (correction)

Author: Anthony Cozzie

Date: 07:54:05 04/18/04

Go up one level in this thread


On April 18, 2004 at 07:25:58, Vasik Rajlich wrote:

>On April 17, 2004 at 09:19:02, Anthony Cozzie wrote:
>
>>On April 17, 2004 at 05:27:23, Vasik Rajlich wrote:
>>
>>>On April 16, 2004 at 23:15:34, Robert Hyatt wrote:
>>>
>>>>On April 16, 2004 at 22:57:41, Robert Hyatt wrote:
>>>>
>>>>>On April 16, 2004 at 20:05:47, Vincent Diepeveen wrote:
>>>>>
>>>>>>On April 16, 2004 at 14:32:32, Anthony Cozzie wrote:
>>>>>>
>>>>>>>On April 16, 2004 at 14:27:55, Gerd Isenberg wrote:
>>>>>>>
>>>>>>>>On April 16, 2004 at 12:40:21, Robert Hyatt wrote:
>>>>>>>><snip>
>>>>>>>>>>>>So a possible alternative to evaltable would be hashing in qsearch.
>>>>>>>>>>>>
>>>>>>>>>>>>Evaltable would be my guess though. Doing a random lookup to a big hashtable at
>>>>>>>>>>>>a 400Mhz dual Xeon costs when it is not in the cache around 400ns.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>Depends.  Use big memory pages and it costs 150 ns.  No TLB thrashing then.
>>>>>>>>>>
>>>>>>>>>>We all know that when things are in L2 cache it's faster. However when using
>>>>>>>>>>hashtables by definition you are busy with TLB trashing.
>>>>>>>>>
>>>>>>>>>Wrong.
>>>>>>>>>
>>>>>>>>>Do you know what the TLB does?  Do you know how big it is?  Do you know what
>>>>>>>>>going from 4KB to 2MB/4MB page sizes does to that?
>>>>>>>>>
>>>>>>>>>Didn't think so...
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>Use big hashtables start doing lookups to your hashtable and it costs on average
>>>>>>>>>>400 ns on a dual k7 and your dual Xeon. Saying that under conditions A and B,
>>>>>>>>>>which hardly happen, that it is 150ns makes no sense. When it's in L2 cache at
>>>>>>>>>>opteron it just costs 13 cycles. Same type of comparision.
>>>>>>>>>
>>>>>>>>>Using 4mb pages, my hash probes do _not_ take 400ns.  You can say it all you
>>>>>>>>>want, but it will never be true.  The proof is intuitive for anyone
>>>>>>>>>understanding the relationship between number of virtual pages and TLB size.
>>>>>>>>>
>>>>>>>>
>>>>>>>>Hi Bob,
>>>>>>>>
>>>>>>>>I guess you are talking about P4/Xeon.
>>>>>>>>
>>>>>>>>What i read so far about opteron, there is a two level Data-TLB as "part" of the
>>>>>>>>L1-Data Cache (1024 - 64 Byte cache lines), which maps the most-recently-used
>>>>>>>>virtual addresses to their physical addresses. The primary TLB has 40 entries,
>>>>>>>>32 for 4KB pages, only eight for 2MB pages. The secondary contains 512 entries
>>>>>>>>for 4KB pages only. De Vries explains the "expensive" table-walk very
>>>>>>>>instructive.
>>>>>>>>
>>>>>>>>Do you believe, with huge random access tables, let say >= 512MB,
>>>>>>>>that eight 2MB pages helps to avoid TLB trashing?
>>>>>>>>
>>>>>>>>Are there special OS-dependent mallocs to get those huge pages?
>>>>>>>>What about using one 2M page for some combined CONST or DATA-segments?
>>>>>>>>It would be nice to guide the linker that way.
>>>>>>>>
>>>>>>>>Thanks,
>>>>>>>>Gerd
>>>>>>>>
>>>>>>>>
>>>>>>>>some references:
>>>>>>>>
>>>>>>>>Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™
>>>>>>>>Processors
>>>>>>>>Appendix A Microarchitecture for AMD Athlon™ 64 and AMD Opteron™ Processors
>>>>>>>>A.9 Translation-Lookaside Buffer
>>>>>>>>
>>>>>>>>
>>>>>>>>Understanding the detailed Architecture of AMD's 64 bit Core
>>>>>>>>                 by Hans de Vries
>>>>>>>>
>>>>>>>>http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
>>>>>>>>
>>>>>>>>3.3 The Data Cache Hit / Miss Detection: The cache tags and the primairy TLB's
>>>>>>>>3.4 The 512 entry second level TLB
>>>>>>>>3.16 The TLB Flush Filter CAM
>>>>>>>
>>>>>>>
>>>>>>>Opteron blows in this regard, but I believe that all 64(?) of the P4's TLB
>>>>>>>entries can be toggled to large pages.
>>>>>>>
>>>>>>>Also, it is worth nothing that the opterons current TLB supports only 4MB of
>>>>>>>memory using the 4KB pages, so 8 MB is still an improvement :)  Hopefully when
>>>>>>>AMD transisitions to the 90nm process the new core will fix this.
>>>>>>>
>>>>>>>anthony
>>>>>>
>>>>>>Opteron doesn't blow at all. Just test the speed you can get data to you at
>>>>>>opteron.
>>>>>>
>>>>>>It's 2.5 times faster than Xeon in that respect.
>>>>>
>>>>>
>>>>>Wrong.  Just do the math.
>>>>>
>>>>>4-way opteron.  basic latency will be about 70ns for local memory, 140 for two
>>>>>close blocks of non-local memory, and 210 for that last block of non-local
>>>>>memory.
>>>>>
>>>>>Intel = 150ns period.  Thrash the TLB and it can go to 450ns for the two-level
>>>>>map.  Shrink to 4mb pages and this becomes a 1-level map with a max latency of
>>>>>300ns with thrashed TLB, or 150 if the TLB is pretty effective.
>>>>>
>>>>>Now thrash the TLB on the opteron and every memory reference will require 4
>>>>>memory references (If I read right, since opteron uses 48 bit virtual address
>>>>>space, the map is broken into three parts).
>>>>
>>>>The above is wrong.  The memory mapping table is broken into _four_ parts, not
>>>>three.  Add another 70ns to 210ns to all my estimates...  When I originally
>>>>looked at this a couple of months back, I read "four" to be 3 map lookups plus
>>>>the reference to fetch the data after the address is mapped.  It really is 4 map
>>>>lookups plus another to fetch the actual data.
>>>>
>>>>Benefit is that opteron has (at present) a 40 bit physical address space, and a
>>>>48 bit virtual address space.  Intel is 32 bit virtual per process, 36 bit
>>>>physical address space...
>>>>
>>>>
>>>>> Best case is 280ns for four probes
>>>>>to local memory.  Worst case is 840ns for four probes to most remote memory.
>>>>>You could make the O/S replicate the memory map into local memory for all the
>>>>>processors (4x memory waste) and limit this to 210ns for mapping plus the
>>>>>70-210ns cost to actually fetch the data.
>>>>>
>>>>>Intel's big pages help quite a bit.  Only problem is that the intel TLB is not
>>>>>_that_ big.  But 4mb pages reduce the requirement.  AMD will fix this soon...
>>>>>
>>>>>The opteron advantage evaporates quickly, when you use big pages on Intel...
>>>>>
>>>>>Processor is still very good of course.  But it will get even better when this
>>>>>is cleaned up.
>>>
>>>I wonder if there could be some way to start this fetching process earlier.
>>>
>>>For example, as soon as you generate a move, you instruct/fool the processor
>>>into starting the hash entry fetch right there, while normal execution continues
>>>in parallel until it needs this data.
>>>
>>>Also - it wasn't clear to me - if the data you are looking for is in a 2 M page,
>>>are you skipping steps 1-3 of the table talk, or only step 3? (leaving three
>>>steps)
>>>
>>>If you skip steps 1-3, I can't imagine why AMD would only have 8 of these
>>>entries.
>>>
>>>Vas
>>
>>I have looked into this extensively.  The problem is that if it misses in the
>>TLB, it gives up.  So if you use big pages you can cut your transposition table
>>latency to almost zero with some prefetching in makemove.
>>
>>anthony
>
>How did you conclude this? By testing, or reading?
>
>I looked around a little, it wasn't mentioned anywhere.
>
>http://www.amd.com/us-en/assets/content_type/DownloadableAssets/Optimization_-_Tim_Wilkens.pdf
>
>pages 6 & 17
>
>
>http://www.amd.com/us-en/assets/content_type/DownloadableAssets/dwamd_AMD_GDC_2004_MW.pdf
>
>page 18
>
>
>etc ..
>
>But now I can see how Fritz ended up being coded in assembly :-)
>
>Vas

I'm not finding it either :(

I distinctly remember reading this somewhere, but you are welcome to try it
yourself.  Correct me if I am wrong, but if it is not in the TLB, the page might
be on the hard disk and require a system call.  That might explain it.

Vincent has challenged me to prove my theory that hugetlb is better, so after I
do that I may try prefetching again.  I should be able to prove something one
way or another.

anthony



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.