Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: DIEP NUMA SMP at P4 3.06Ghz with Hyperthreading

Author: Matt Taylor
Date: 21:51:29 12/14/02
On December 14, 2002 at 13:44:00, Robert Hyatt wrote:

>On December 14, 2002 at 02:05:05, Matt Taylor wrote:
>
>>On December 13, 2002 at 22:56:17, Robert Hyatt wrote:
>>
>>>On December 13, 2002 at 16:41:07, Vincent Diepeveen wrote:
>>>
>>>>On December 13, 2002 at 16:03:47, Robert Hyatt wrote:
>>>>
>>>>your math is wrong for many reasons.
>>>>
>>>>0) it isn't 32 bytes but 64 bytes that you get at once
>>>>   garantueed.
>>>
>>>Depends on the processor.  For PIII and earlier, it is _32_ bytes.  For
>>>the PIV it is 128 bytes.  I think AMD is 64...
>>>
>>>
>>>>1) you can garantuee that you need just cacheline
>>>
>>>Yes you can, by making sure you probe to a starting address that is
>>>divisible by the cache line size exactly.  Are you doing that?  Are
>>>you sure your table is initially aligned on a multiple of cache line
>>>size?  Didn't think so.  You can't control malloc() that well yet...
>>>And it isn't smart enough to know it should do that, particularly when
>>>the alignment is processor dependent.
>>
>>You can detect that alignment. As for aligning with malloc, it is an easy trick.
>>
>>malloc(x) => malloc(x + align - 1) & ~align
>
>If you look at the crafty source you will see that I already do this.   But it
>takes a
>specific bit of programming to make it happen.  I'd bet that Vincent didn't do
>this
>previously since he didn't mention it.
>
>(utility.c, look for first malloc).
>
>Note that you do need to save the original (unmolested address) so that you can
>later free()
>it if you need to.
>
>>
>>>>2) even if you need 2, then those aren't 400 clocks each
>>>>   cache line but the first one is 400 and the second
>>>>   one is only a very small part of that (consider the
>>>>   bandwidth the memory delivers!)
>>>
>>>Try again.  You burst load one cache line and that is all.  The first 8
>>>bytes comes across after a long latency.  The rest of the line bursts in
>>>quickly.  For the next cache miss, back you go to the long latency.
>>
>>Actually, you probably won't incur much latency at all. The latency is based on
>>the assumption that RAS and CAS will have to be re-latched into the memory.
>>Locality of data is more efficient.
>
>
>Perhaps, but beyond 128 bytes?  It wasn't so in the memory specs I looked at as
>they
>don't just ramp things out of the DRAM forever...

No, what I mean is that you have a RAS latency and a CAS latency. RAS latency is
something like 7 bus clocks. CAS latency is 2-2.5 (effectively 3) bus clocks.
The latency figure of 10 bus clocks is RAS + CAS. If you don't have to re-latch
RAS, you don't incur the 7 bus clock penalty on it.

Two random probes will probably lie in two different rows. The latency overhead
will (probably) be 20 bus clocks. Even if you have to load a second cache line,
the locality would limit it to ~13 bus clocks latency. That's a theoretical 35%
speed-up.

Also, you can have the processor fetch the next cache line using the prefetch
instruction. I'm not sure how useful it would be in this situation, but you can
sometimes hide all of the latency from the next sequential probe.

-Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.