Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Move ordering ?

Author: James Robertson

Date: 08:37:46 10/25/98

Go up one level in this thread


On October 25, 1998 at 11:00:13, Robert Hyatt wrote:

>On October 24, 1998 at 12:56:41, James Robertson wrote:
>
>>On October 24, 1998 at 09:59:19, Robert Hyatt wrote:
>>
>>>On October 24, 1998 at 01:14:12, Roberto Waldteufel wrote:
>>>
>>>>
>>>>On October 23, 1998 at 20:52:23, Robert Hyatt wrote:
>>>>
>>>>>On October 23, 1998 at 10:39:40, James Robertson wrote:
>>>>>
>>>>>>On October 23, 1998 at 02:43:41, Roberto Waldteufel wrote:
>>>>>>
>>>>>>>
>>>>>>>On October 22, 1998 at 16:34:11, Robert Hyatt wrote:
>>>>>>>
>>>>>>>>On October 22, 1998 at 14:03:01, Roberto Waldteufel wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>I know this may seem like a dumb question, but could you explain exactly how the
>>>>>>>>>cache works? I know the basic idea is to speed up memory acces by holding
>>>>>>>>>frequently accessed data in a special place, the "cache", from where it may be
>>>>>>>>>accessed more quickly, but I don't know much more than that. Is the cache part
>>>>>>>>>of RAM, or is it held on the CPU chip, or on a separate chip? What is the
>>>>>>>>>significance of L1 and L2 cache? I have heard that sometimes the cache is
>>>>>>>>>accessed at different speeds depending on the machine. Is it possible to add
>>>>>>>>>more cache, and if so would this be likely to improve performance? The most
>>>>>>>>>important thing I wold like to understand is how I can organise my programs so
>>>>>>>>>as to extract maximum benefit from the cache available (I use a P II 333MHz).
>>>>>>>>>
>>>>>>>>>In the Spanish Championships I heard that some programs (including mine),
>>>>>>>>>especially the bigger ones, were at a disadvantage due to the absence of L2
>>>>>>>>>cache on the Celeron machines. I don't know how big an effect this may have had.
>>>>>>>>>If I better understood how caching works, maybe I could improve the way I code
>>>>>>>>>things. I'm afraid I'm rather a novice regarding hardware details like this.
>>>>>>>>>
>>>>>>>>>Thanks in advance,
>>>>>>>>>Roberto
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>The idea is simple.  On the pentium/pentiumII/etc machines, cache is broken
>>>>>>>>up into "lines" of 32 bytes.  When the CPU fetches a word, instruction,
>>>>>>>>a byte, etc, the entire block of 32 bytes (one line) is copied to the cache
>>>>>>>>in a very fast burst.  Then, from this point forward, rather than waiting
>>>>>>>>for 20+ cpu cycles to fetch data, it is fetched from the L2 cache in 2
>>>>>>>>cycles on the Pentium II, or 1 cycle on the pentium pro or xeons.  It can
>>>>>>>>make the program run much faster as you might guess, in that when you
>>>>>>>>modify memory, you really only modify the cache, and memory is updated way
>>>>>>>>later if possible.  So a loop might increment a value 100 times, but memory
>>>>>>>>is only fetched once and written to once, which saves substantial amounts
>>>>>>>>of time.
>>>>>>>>
>>>>>>>>The plain celeron 300 doesn't have any L2 cache, while the newer celeronA
>>>>>>>>or any that are 333mhz and faster have 128KB of L2 cache that is as fast
>>>>>>>>as the pentium pro/xeon (core cpu speed rather than core cpu speed / 2)
>>>>>>>>
>>>>>>>>But it is nothing more than a high-speed buffer between the CPU and
>>>>>>>>memory, but it is obviously much smaller than memory...
>>>>>>>
>>>>>>>Hi Bob,
>>>>>>>
>>>>>>>Thanks for the explanation. Would I be right in thinking that data structures of
>>>>>>>exactly 32 bytes lengthe would be the most efficient on PII machines, so that
>>>>>>>the structure fits exactly into one "line" of cache?
>>>>>>
>>>>>>I think that the computer uses the same method it does with the registers; i.e.
>>>>>>the computer reads and writes in 32 bit chunks, regardless of the size of the
>>>>>>data being transferred. It will take 2 machine cycles for a 64 bit structure,
>>>>>>but an 8 bit and a 32 bit structure will both take 1 machine cycle. I think this
>>>>>>is how it works; if I am wrong please correct me!!
>>>>>>
>>>>>>James
>>>>>
>>>>>
>>>>>It's a complicated issue, but basically the pentium *always* reads 32 bytes at
>>>>>a time using 4 consecutive cycles, so it can fill a complete cache line.  The
>>>>>main thing is to put data close together that will be used at about the same
>>>>>time to take advantage of this "pre-fetch"...
>>>>>
>>>>>
>>>>I think I start to get the idea now. Say I am using a hash table of a million
>>>>entries, and I have, say, 4 pieces of information to store in each entry I use.
>>>>Then I do better to have one array of a million single, multi-purpose records
>>>>than to have 4 separate arrays of a million entries each, one for each type of
>>>>information. That way, when I access the first piece of information, I
>>>>"pre-fetch" the others at the same time, whereas with four separate arrays, the
>>>>information resides in 4 distant locations, right? I never really considered
>>>>this sort of thing before, but maybe I could redesign a lot of my arrays in the
>>>>light of the cache usage.
>>>>
>>>>Best wishes,
>>>>Roberto
>>>>
>>>
>>>correct... also if you have a structure with something like this:
>>>
>>>   int a,b,c,d,e,f,g,h.i.j.k.l,m,n,o,p,....
>>>
>>>make sure that you put the ones you use in the same small piece of
>>>code adjacent in the structure, so that if you use "a" at line 10, it
>>>will load b-h at the same time.  If you use those while they are still
>>>in cache, you save a bunch of cycles waiting on another cache line fill.
>>
>>I do not understand; say I have four 8-bit values in a row, and I want to fetch
>>the first, will the other three come along too? In C++, how would I pick up all
>>four values at once? Once I do get them, in C++, how would I be sure they are
>>still in cache when I want to use them (assuming I wan't to use them in the same
>>small piece of code, or maybe the same function)?
>>
>>James
>>
>
>
>Here's the way to think about this:  Imagine that memory is broken up into
>32-byte chunks, starting at address 0.  So you have chunk 0 = bytes 0-31,
>then chunk 1 = bytes 32-63, and so forth for *all* of memory.  When you
>reference *any* byte in a chunk, the entire chunk is loaded into L2 cache
>in one "burst" transmission (4 bus cycles).  Since this happens, you should
>group things in "clumps" of 32 bytes so that once you use any of those
>bytes, you use the others quickly (while they are in cache) as well, as there
>won't be any delays for any but the first byte in a "clump" (clump = cache
>line for the techncally correct term)...
>

I still have problems understanding. Suppose you have, starting at address 0,
four 8-bit values. If you access the first, then the whole "clump" of 32 bytes
are loaded into the L2 cache. Ok, that makes sense.

Suppose you access the second byte in chunk 0. Does the computer load chunk 0
(bytes 0-3) into memory? or does it load bytes 1-4 into memory? Thanks for the
help!

James

>
>
>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>Best wishes,
>>>>>>>Roberto



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.