Author: Robert Hyatt
Date: 06:30:36 07/15/03
Go up one level in this thread
On July 15, 2003 at 06:20:28, Vincent Diepeveen wrote: >On July 14, 2003 at 15:33:37, Gerd Isenberg wrote: > >>On July 14, 2003 at 10:54:49, Vincent Diepeveen wrote: >> >>>On July 13, 2003 at 17:10:10, Russell Reagan wrote: >>> >>>>On July 13, 2003 at 13:17:56, Bas Hamstra wrote: >>>> >>>>>It is used *extremely* intensive. Therefore I assumed that most of the time the >>>>>table sits in cache. But apparently no... Makes you wonder about other simple >>>>>lookup's. A lot of 10 cycle penalties, it seems. >>>> >>>>Hi Bas, >>>> >>>>Why you say "10 cycles"? I thought memory latency was many more cycles (~75 - >>>>150+). >>> >>>Random read from memory at dual P4 or dual K7 is like nearly 400 nanoseconds. >>>So that's at 2Ghz around 800 cycles. >>> >>>Best regards, >>>Vincent >> >>Hi Vincent, >> >>puhh... that's about 1/2 microsecond. I remember the days with >>2MHz - 8085 or Z80 CPU - can't beleave it. A few questions... >> >>I'm not familar with dual-architectures. Is it a kind of shared memory via >>pci-bus? How do you access such ram - are the some alloc like api-functions? >>What happens, if one perocessor writes this memory through cache - what about >>possible cache copies of this address in the other processor, or in general how >>do the severel processor caches syncronise? >>I guess each processor has it's own local main-memory. >> >>Do you know the read latencies of single processor P4 or K7 with state of the >>art chipsets? > >Yes single cpu P4s were tested at around 280 ns here. > >For random reads. Read me well. I will ship you source code to test it. >you can see for yourself then. > >It really is that bad. > >this is why future itanium processors intel had planned to create huge L3 caches >like up to 24MB or something in a few years. > >>1.) if data is already in 1. level cache > >I need to quote from head now: P4 needs 2 cycles and K7 3 > >>2.) if data is in 2. level cache but not in 1. > >Depends upon processor. For example the diagram shown by Priestly at the >conference (marketing manager from intel) shows next for Itanium2-madison with >6MB L3 cache (which is the fastest processor they got, the 1.3 budget processors >are worse than this): > >L3 cache 3/6MB 128B CL 24-way 14-17 clks (arrows from/to L2 cache) >L2 cache 256KB 128B CL 8 way 5-7 clks (arrows from to L1 cache) >L1I 16KB 64B CL 1 clk >L1D 16KB 64B CL 1 clk > >I have his presentation sheets in PDF format here. If you want to i can ship it >to you. > >>3.) in worst case, if data is only in main memory but in no cache > >That depends upon whether the line to memory is already opened. So sequential >reads are *considerable* faster latency than random read latency. > >For P4 single cpu a random lookup in the memory is around 300 nanoseconds. >For dual it's nearly 400 nanoseconds as i measured. > >If you want to you can have this source code to measure the average lookup >speed. It gets influenced so much by random lookups into memory that random >lookups in 3 is dominating outcome of the results there. > >I am talking about measured values. Not guessed values like Hyatt is talking >about. You are talking about utter crap, not measured values. lm-bench is the industry-accepted measure for random access memory latency, memory bandwidth, cache bandwidth, cache latency, and everything else. Ramble all you want. _anybody_ can look up lm-bench on the web and see that what you are writing is utter nonsense. All they have to do is _look_. I wouldn't trust a program you wrote to add 2+2 correctly, much less try to measure any sort of latency. To do that requires hardware understanding _first_. Which is sadly lacking on your part. But rather than continue this, all anybody has to do is to download the program _everybody_ uses to measure memory performance, and then run it. > >>Thanks in advance, >>Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.