Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Move ordering ?

Author: Robert Hyatt

Date: 10:28:05 10/25/98

Go up one level in this thread


On October 25, 1998 at 11:37:46, James Robertson wrote:

>On October 25, 1998 at 11:00:13, Robert Hyatt wrote:
>
>>On October 24, 1998 at 12:56:41, James Robertson wrote:
>>
>>>On October 24, 1998 at 09:59:19, Robert Hyatt wrote:
>>>
>>>>On October 24, 1998 at 01:14:12, Roberto Waldteufel wrote:
>>>>
>>>>>
>>>>>On October 23, 1998 at 20:52:23, Robert Hyatt wrote:
>>>>>
>>>>>>On October 23, 1998 at 10:39:40, James Robertson wrote:
>>>>>>
>>>>>>>On October 23, 1998 at 02:43:41, Roberto Waldteufel wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>On October 22, 1998 at 16:34:11, Robert Hyatt wrote:
>>>>>>>>
>>>>>>>>>On October 22, 1998 at 14:03:01, Roberto Waldteufel wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>I know this may seem like a dumb question, but could you explain exactly how the
>>>>>>>>>>cache works? I know the basic idea is to speed up memory acces by holding
>>>>>>>>>>frequently accessed data in a special place, the "cache", from where it may be
>>>>>>>>>>accessed more quickly, but I don't know much more than that. Is the cache part
>>>>>>>>>>of RAM, or is it held on the CPU chip, or on a separate chip? What is the
>>>>>>>>>>significance of L1 and L2 cache? I have heard that sometimes the cache is
>>>>>>>>>>accessed at different speeds depending on the machine. Is it possible to add
>>>>>>>>>>more cache, and if so would this be likely to improve performance? The most
>>>>>>>>>>important thing I wold like to understand is how I can organise my programs so
>>>>>>>>>>as to extract maximum benefit from the cache available (I use a P II 333MHz).
>>>>>>>>>>
>>>>>>>>>>In the Spanish Championships I heard that some programs (including mine),
>>>>>>>>>>especially the bigger ones, were at a disadvantage due to the absence of L2
>>>>>>>>>>cache on the Celeron machines. I don't know how big an effect this may have had.
>>>>>>>>>>If I better understood how caching works, maybe I could improve the way I code
>>>>>>>>>>things. I'm afraid I'm rather a novice regarding hardware details like this.
>>>>>>>>>>
>>>>>>>>>>Thanks in advance,
>>>>>>>>>>Roberto
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>The idea is simple.  On the pentium/pentiumII/etc machines, cache is broken
>>>>>>>>>up into "lines" of 32 bytes.  When the CPU fetches a word, instruction,
>>>>>>>>>a byte, etc, the entire block of 32 bytes (one line) is copied to the cache
>>>>>>>>>in a very fast burst.  Then, from this point forward, rather than waiting
>>>>>>>>>for 20+ cpu cycles to fetch data, it is fetched from the L2 cache in 2
>>>>>>>>>cycles on the Pentium II, or 1 cycle on the pentium pro or xeons.  It can
>>>>>>>>>make the program run much faster as you might guess, in that when you
>>>>>>>>>modify memory, you really only modify the cache, and memory is updated way
>>>>>>>>>later if possible.  So a loop might increment a value 100 times, but memory
>>>>>>>>>is only fetched once and written to once, which saves substantial amounts
>>>>>>>>>of time.
>>>>>>>>>
>>>>>>>>>The plain celeron 300 doesn't have any L2 cache, while the newer celeronA
>>>>>>>>>or any that are 333mhz and faster have 128KB of L2 cache that is as fast
>>>>>>>>>as the pentium pro/xeon (core cpu speed rather than core cpu speed / 2)
>>>>>>>>>
>>>>>>>>>But it is nothing more than a high-speed buffer between the CPU and
>>>>>>>>>memory, but it is obviously much smaller than memory...
>>>>>>>>
>>>>>>>>Hi Bob,
>>>>>>>>
>>>>>>>>Thanks for the explanation. Would I be right in thinking that data structures of
>>>>>>>>exactly 32 bytes lengthe would be the most efficient on PII machines, so that
>>>>>>>>the structure fits exactly into one "line" of cache?
>>>>>>>
>>>>>>>I think that the computer uses the same method it does with the registers; i.e.
>>>>>>>the computer reads and writes in 32 bit chunks, regardless of the size of the
>>>>>>>data being transferred. It will take 2 machine cycles for a 64 bit structure,
>>>>>>>but an 8 bit and a 32 bit structure will both take 1 machine cycle. I think this
>>>>>>>is how it works; if I am wrong please correct me!!
>>>>>>>
>>>>>>>James
>>>>>>
>>>>>>
>>>>>>It's a complicated issue, but basically the pentium *always* reads 32 bytes at
>>>>>>a time using 4 consecutive cycles, so it can fill a complete cache line.  The
>>>>>>main thing is to put data close together that will be used at about the same
>>>>>>time to take advantage of this "pre-fetch"...
>>>>>>
>>>>>>
>>>>>I think I start to get the idea now. Say I am using a hash table of a million
>>>>>entries, and I have, say, 4 pieces of information to store in each entry I use.
>>>>>Then I do better to have one array of a million single, multi-purpose records
>>>>>than to have 4 separate arrays of a million entries each, one for each type of
>>>>>information. That way, when I access the first piece of information, I
>>>>>"pre-fetch" the others at the same time, whereas with four separate arrays, the
>>>>>information resides in 4 distant locations, right? I never really considered
>>>>>this sort of thing before, but maybe I could redesign a lot of my arrays in the
>>>>>light of the cache usage.
>>>>>
>>>>>Best wishes,
>>>>>Roberto
>>>>>
>>>>
>>>>correct... also if you have a structure with something like this:
>>>>
>>>>   int a,b,c,d,e,f,g,h.i.j.k.l,m,n,o,p,....
>>>>
>>>>make sure that you put the ones you use in the same small piece of
>>>>code adjacent in the structure, so that if you use "a" at line 10, it
>>>>will load b-h at the same time.  If you use those while they are still
>>>>in cache, you save a bunch of cycles waiting on another cache line fill.
>>>
>>>I do not understand; say I have four 8-bit values in a row, and I want to fetch
>>>the first, will the other three come along too? In C++, how would I pick up all
>>>four values at once? Once I do get them, in C++, how would I be sure they are
>>>still in cache when I want to use them (assuming I wan't to use them in the same
>>>small piece of code, or maybe the same function)?
>>>
>>>James
>>>
>>
>>
>>Here's the way to think about this:  Imagine that memory is broken up into
>>32-byte chunks, starting at address 0.  So you have chunk 0 = bytes 0-31,
>>then chunk 1 = bytes 32-63, and so forth for *all* of memory.  When you
>>reference *any* byte in a chunk, the entire chunk is loaded into L2 cache
>>in one "burst" transmission (4 bus cycles).  Since this happens, you should
>>group things in "clumps" of 32 bytes so that once you use any of those
>>bytes, you use the others quickly (while they are in cache) as well, as there
>>won't be any delays for any but the first byte in a "clump" (clump = cache
>>line for the techncally correct term)...
>>
>
>I still have problems understanding. Suppose you have, starting at address 0,
>four 8-bit values. If you access the first, then the whole "clump" of 32 bytes
>are loaded into the L2 cache. Ok, that makes sense.
>
>Suppose you access the second byte in chunk 0. Does the computer load chunk 0
>(bytes 0-3) into memory? or does it load bytes 1-4 into memory? Thanks for the
>help!
>
>James
>

cache lines are *always* aligned on 32 byte boundaries...  so if you access
*any* byte between 0 and 31, all the bytes in the line get loaded from bytes
0-31 in memory...  non-boundary aligned caches were done many years ago, but
cause problems with todays 32 bit "banks"...




>>
>>
>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>Best wishes,
>>>>>>>>Roberto



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.