Author: Robert Hyatt
Date: 14:05:00 07/17/03
Go up one level in this thread
On July 17, 2003 at 09:16:10, Dieter Buerssner wrote: >I use an inner loop, that just translates to a stream of move memory to register >instructions (one for each access). Here are some results (source at the end of >the posting, not well tested, please report the errors/flaws ...) > The main flaw is you are not testing "memory latency" here. If you look at how the X86 does virtual memory, it is a two-level memory lookup. To avoid this penalty, the TLB holds recent virtual-to-real address mapings, but the TLB is not huge. On my dual xeon, lm-bench reports that the TLB holds the most recent 62 virtual-to-real translations. What you are measuring is at least _two_ memory latency cycles, one or two to do the virtual to real address translation, then another to actually fetch the data. To compute a _real_ raw memory latency number, you have to avoid overwriting the TLB too badly. Otherwise the latency is inflated by the MMU overhead that isn't actually hit on "normal application" that badly. >C:\yace\vincent>dblat 300000000 >Setting up a random access pattern, may take a while >Finished >Random access: 30.864 s, 308.640 ns/access >Testing same pattern again >Random access: 30.744 s, 307.440 ns/access >Setting up a different random access pattern, may take a while >Finished >Random access: 30.735 s, 307.350 ns/access >Testing same pattern again >Random access: 30.724 s, 307.240 ns/access >Sequential access offset 1: 0.310 s, 3.100 ns/access >Sequential access offset 2: 0.601 s, 6.010 ns/access >Sequential access offset 4: 1.182 s, 11.820 ns/access >Sequential access offset 8: 2.263 s, 22.630 ns/access >Sequential access offset 16: 4.847 s, 48.470 ns/access >Sequential access offset 32: 17.315 s, 173.150 ns/access >Sequential access offset 64: 16.984 s, 169.840 ns/access >Sequential access offset 128: 17.475 s, 174.750 ns/access >Sequential access offset 256: 18.296 s, 182.960 ns/access >Sequential access offset 512: 19.688 s, 196.880 ns/access >Sequential access offset 1024: 22.513 s, 225.130 ns/access >Sequential access offset 2048: 23.013 s, 230.130 ns/access >Sequential access offset 4096: 22.883 s, 228.830 ns/access >Sequential access offset 8192: 23.603 s, 236.030 ns/access >Sequential access offset -1: 0.330 s, 3.300 ns/access >Sequential access offset -2: 0.620 s, 6.200 ns/access >Sequential access offset -4: 1.222 s, 12.220 ns/access >Sequential access offset -8: 2.453 s, 24.530 ns/access >Sequential access offset -16: 4.847 s, 48.470 ns/access >Sequential access offset -32: 17.355 s, 173.550 ns/access >Sequential access offset -64: 16.944 s, 169.440 ns/access >Sequential access offset -128: 17.455 s, 174.550 ns/access >Sequential access offset -256: 18.206 s, 182.060 ns/access >Sequential access offset -512: 19.538 s, 195.380 ns/access >Sequential access offset -1024: 22.282 s, 222.820 ns/access >Sequential access offset -2048: 22.663 s, 226.630 ns/access >Sequential access offset -4096: 22.653 s, 226.530 ns/access >Sequential access offset -8192: 23.444 s, 234.440 ns/access > >Vincent's program reports 325 ns, which is not too far off from the >random access number. > >C:\yace\vincent>dblat 100000000 >Setting up a random access pattern, may take a while >Finished >Random access: 24.315 s, 243.150 ns/access >Testing same pattern again >Random access: 24.175 s, 241.750 ns/access >Setting up a different random access pattern, may take a while >Finished >Random access: 24.165 s, 241.650 ns/access >Testing same pattern again >Random access: 24.174 s, 241.740 ns/access >Sequential access offset 1: 0.320 s, 3.200 ns/access >Sequential access offset 2: 0.601 s, 6.010 ns/access >Sequential access offset 4: 1.162 s, 11.620 ns/access >Sequential access offset 8: 2.263 s, 22.630 ns/access >Sequential access offset 16: 4.857 s, 48.570 ns/access >Sequential access offset 32: 17.345 s, 173.450 ns/access >Sequential access offset 64: 16.974 s, 169.740 ns/access >Sequential access offset 128: 17.456 s, 174.560 ns/access >Sequential access offset 256: 18.126 s, 181.260 ns/access >Sequential access offset 512: 19.509 s, 195.090 ns/access >Sequential access offset 1024: 22.252 s, 222.520 ns/access >Sequential access offset 2048: 22.753 s, 227.530 ns/access >Sequential access offset 4096: 22.772 s, 227.720 ns/access >Sequential access offset 8192: 23.353 s, 233.530 ns/access >Sequential access offset -1: 0.330 s, 3.300 ns/access >Sequential access offset -2: 0.631 s, 6.310 ns/access >Sequential access offset -4: 1.242 s, 12.420 ns/access >Sequential access offset -8: 2.454 s, 24.540 ns/access >Sequential access offset -16: 4.827 s, 48.270 ns/access >Sequential access offset -32: 17.364 s, 173.640 ns/access >Sequential access offset -64: 16.964 s, 169.640 ns/access >Sequential access offset -128: 17.465 s, 174.650 ns/access >Sequential access offset -256: 18.106 s, 181.060 ns/access >Sequential access offset -512: 19.498 s, 194.980 ns/access >Sequential access offset -1024: 22.262 s, 222.620 ns/access >Sequential access offset -2048: 22.682 s, 226.820 ns/access >Sequential access offset -4096: 22.652 s, 226.520 ns/access >Sequential access offset -8192: 23.433 s, 234.330 ns/access > >Vincent: 256 ns >Note, random access is faster than before >I get similar numbers for smaller sizes bigger than the cache > >One final example, for everything in L2 cache: > >C:\yace\vincent>dblat 250000 >Setting up a random access pattern, may take a while >Finished >Random access: 0.751 s, 7.510 ns/access >Resting same pattern again >Random access: 0.751 s, 7.510 ns/access >Setting up a different random access pattern, may take a while >Finished >Random access: 0.751 s, 7.510 ns/access >Testing same pattern again >Random access: 0.751 s, 7.510 ns/access >Sequential access offset 1: 0.100 s, 1.000 ns/access >Sequential access offset 2: 0.120 s, 1.200 ns/access >Sequential access offset 4: 0.180 s, 1.800 ns/access >Sequential access offset 8: 0.481 s, 4.810 ns/access >Sequential access offset 16: 0.751 s, 7.510 ns/access >Sequential access offset 32: 0.751 s, 7.510 ns/access >Sequential access offset 64: 0.771 s, 7.710 ns/access >Sequential access offset 128: 0.751 s, 7.510 ns/access >Sequential access offset 256: 0.762 s, 7.620 ns/access >Sequential access offset 512: 0.751 s, 7.510 ns/access >Sequential access offset 1024: 0.751 s, 7.510 ns/access >Sequential access offset 2048: 0.761 s, 7.610 ns/access >Sequential access offset 4096: 0.811 s, 8.110 ns/access >Sequential access offset 8192: 0.751 s, 7.510 ns/access >Sequential access offset -1: 0.120 s, 1.200 ns/access >Sequential access offset -2: 0.150 s, 1.500 ns/access >Sequential access offset -4: 0.381 s, 3.810 ns/access >Sequential access offset -8: 0.751 s, 7.510 ns/access >Sequential access offset -16: 0.741 s, 7.410 ns/access >Sequential access offset -32: 0.741 s, 7.410 ns/access >Sequential access offset -64: 0.741 s, 7.410 ns/access >Sequential access offset -128: 0.771 s, 7.710 ns/access >Sequential access offset -256: 0.741 s, 7.410 ns/access >Sequential access offset -512: 0.752 s, 7.520 ns/access >Sequential access offset -1024: 0.751 s, 7.510 ns/access >Sequential access offset -2048: 0.751 s, 7.510 ns/access >Sequential access offset -4096: 0.861 s, 8.610 ns/access >Sequential access offset -8192: 0.811 s, 8.110 ns/access > >Summary: There is more, than just an lmbench number. Actually >the comment in lmbench source suggests, that they actually >wanted to get the random access times. > >I don't want to argue about defenition of the "real" memory >latency. But for chess programs/hash the Vinent type number >is the most interesting. > >Regards, >Dieter > >/* dblat.c > * In the current form, it will not work with memory sizes > * of 4 Gb or bigger. > * But it can easily be fixed, by changing the PRNG > * run as "dblat memory_size_in_bytes" > */ >#include <stdio.h> >#include <stdlib.h> >#include <time.h> > >#define N_LOOPS 100000000UL > >double time_stamp(void) >{ > /* Use the timing method, you like */ > return (double)clock()/CLOCKS_PER_SEC; >} > >void *access_loop(void **buf) >{ > size_t n; > void **p = buf; > > /* Unroll by 10, change to your liking */ > n = N_LOOPS/10; /* We dont care about possible remainder */ > do > { > p = (void **)*p; > p = (void **)*p; > p = (void **)*p; > p = (void **)*p; > p = (void **)*p; > p = (void **)*p; > p = (void **)*p; > p = (void **)*p; > p = (void **)*p; > p = (void **)*p; > } > while (--n != 0); > return (void *)p; >} > >void time_access(const char *prompt, void **buf) >{ > double ts = time_stamp(); > access_loop(buf); > ts = time_stamp()-ts; > printf("%s: %7.3f s, %7.3f ns/access\n", prompt, ts, ts/N_LOOPS*1e9); >} > >void setup_seq(void **buf, size_t n, int offset) >{ > size_t i; > for (i=0; i<n; i++) > buf[i] = buf+((i+offset)%n); /* No need to opt. away the % for ini */ >} > >/* Vinc 325, 254 */ >/* PRNG Only for 32 bit pointers */ > >#define MY_RAND_MAX 0xffffffffUL >#define MY_RAND() mwc1616() >static unsigned long zseed = 0x12345678UL; >static unsigned long wseed = 0x87654321UL; > >/* Combination of 2 multiply with carry generators, > just because it does not need much source code */ >static unsigned long mwc1616(void) >{ > unsigned long t = zseed; > zseed=30903*(t&0xffff)+(t>>16); > t = wseed; > wseed=18000*(t&0xffff)+(t>>16); > return ((wseed<<16)&0xffffffffUL) + (zseed&0xffff); >} > >/* Do it as careful as possible */ >/* If you have 64 bit pointers, and unsigned long is > smaller 64 bits, and you want to test memory sizes >= > 4 GB, this has to be changed */ >static unsigned long rand_range(unsigned long range) >{ > unsigned long rmax, r, d; > /* find the largest number rmax <= MY_RAND_MAX, for which > (rmax+1) % range == 0. > All returns from rand() > rmax will be skipped, to guarantee > equal probability for all return values. */ > d = (MY_RAND_MAX+1U-range) / range + 1; /* Note, the overflow is ok */ > rmax = d * range - 1; /* -1 to avoid "overflow to zero" */ > do > r = MY_RAND(); > while (r > rmax); > return r/d; >} > >void setup_random(void **buf, size_t n) >{ > size_t i, r; > void *tmp; > setup_seq(buf, n, 1); > for (i=n-1; i>0; i--) > { > do > { > r = rand_range(i+1); > tmp = buf[r]; > } > while (tmp == buf+i); /* Can this happen? */ > buf[r] = buf[i]; > buf[i] = tmp; > } >} > >int main(int argc, char *argv[]) >{ > int offset; > size_t memsiz, n; > void **buf; > char prompt[256]; > if (argc != 2) > return EXIT_FAILURE; > memsiz = atol(argv[1]); > n = memsiz/sizeof *buf; > buf = malloc(memsiz); > if (buf == NULL) > return EXIT_FAILURE; > > printf("Setting up a random access pattern, may take a while\n"); > setup_random(buf, n); > printf("Finished\n"); > sprintf(prompt, "Random access"); > time_access(prompt, buf); > printf("Testing same pattern again\n"); > time_access(prompt, buf); > printf("Setting up a different random access pattern, may take a while\n"); > setup_random(buf, n); > printf("Finished\n"); > time_access(prompt, buf); > printf("Testing same pattern again\n"); > time_access(prompt, buf); > > for (offset=1; offset <= 8192 && offset < n; offset*=2) > { > setup_seq(buf, n, offset); > sprintf(prompt, "Sequential access offset %5d", offset); > time_access(prompt, buf); > } > for (offset=-1; offset >= -8192 && -offset < n; offset*=2) > { > setup_seq(buf, n, offset); > sprintf(prompt, "Sequential access offset %5d", offset); > time_access(prompt, buf); > } > > free(buf); > return EXIT_SUCCESS; >}
This page took 0.02 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.