Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Latency versus Information Bandwidth: Questions

Author: Robert Hyatt
Date: 07:27:32 12/05/02
On December 05, 2002 at 03:24:02, Tony Werten wrote:

>On December 04, 2002 at 23:23:32, Robert Hyatt wrote:
>
>>On December 04, 2002 at 21:58:27, Jeremiah Penery wrote:
>>
>>>On December 04, 2002 at 21:13:40, Matt Taylor wrote:
>>>
>>>>On December 04, 2002 at 20:29:52, Bob Durrett wrote:
>>>>
>>>>>
>>>>>The recent threads shed some light on the issue of when one is more important
>>>>>than another, but the answer is sketchy and seems to be "depends."
>>>>>
>>>>>For current chess-playing programs, which is more important?  Latency or
>>>>>bandwidth?  Why?
>>>>>
>>>>>Is the answer different if multiple processors are used?
>>>>>
>>>>>Bob D.
>>>>
>>>>The answer is always "depends." It depends on how you access memory, how much
>>>>memory you access, and how often you access memory.
>>>>
>>>>I'm going to make the simplification here that the CPU accesses memory directly;
>>>>some of the work done here is actually part of the chipset, but that's just a
>>>>technical detail and doesn't change any of the conclusions.
>>>>
>>>>In order for an algorithm to be sensitive to bandwidth, it must be accessing
>>>>memory (almost) serially. When the CPU issues a read/write request to main
>>>>memory, it sends the address in two pieces: the row and the column. Sometimes
>>>>the row and column bits are mangled for performance, but for simplicity let's
>>>>assume that the row is the upper half of an address and column is the lower
>>>>half.
>>>>
>>>>The CPU doesn't actually transmit both row -and- column every time it accesses
>>>>memory. The memory module has a row register that remembers which row you
>>>>accessed previously. This isn't just an optimization, either; it reduces power
>>>>constraints and has some other interesting effects for EE people. Anyway, when
>>>>the row changes, the module is forced to "close" the current row and "open" the
>>>>other row. The open process takes some time as the cells in the row must be
>>>>percharged. Avoiding a row change makes memory access faster. The column works
>>>>in a similar fashion. The CL value for ram is the CAS (column address strobe)
>>>>latency, the latency of changing the column address.
>>>>
>>>>Now, if you're accessing memory randomly or in some fashion that requires the
>>>>row or column to change, you will often incur one (or both) CAS and RAS
>>>>latencies. This would make your algorithm latency-dependent.
>>>>
>>>>When multiple processors are used, the answer is a little more obscure. Now that
>>>>both processors are competing for the same memory, each has less bandwidth. Does
>>>>the algorithm spend a -lot- of time in between each memory access? However, at
>>>>the same time, the memory accesses between both processors usually changes the
>>>>row and column. This means the latency is incurred on many cycles.
>>>>
>>>>Notably, though, not all SMP systems are shared-bus. The upcoming x86-64 Opteron
>>>>chips from AMD includes a bus per CPU.
>>>
>>>Current AthlonMP chipsets also have a seperate bus per CPU.  They use the same
>>>EV6 bus as Alpha processors did (or still do?).  The memory modules shared,
>>>whereas Hammer will have separate memory modules for each processor.
>>
>>
>>The problem with that is it turns into a NUMA architecture which has its _own_
>>set of problems.  One cpu connected to one memory module means that the other
>>CPU can't get to it as efficiently...
>
>IIRC they created a new buzzword for that: Hyper Transport. Haven't seen any
>tests yet how well it really works, but it should improve the bandwith.
>
>Tony

Yes, but if each cpu is only directly connected to part of the memory, then the
latency to the other part of memory is going to go up, which is a bad thing.
In some NUMA machines it goes up by a factor of 5-10X even, which is a
killer.

>
>>
>>IE this doesn't offer one tiny bit of improvement over a SMP-type machine with
>>shared memory...  Unless the algorithm is specifically designed to attempt to
>>lccalize memory references and duplicate data that is needed by both threads
>>often...
>>
>>This might be an improvement for running two programs at once.  For one
>>program using two processors, NUMA offers additional challenges for the
>>parallel programmer...
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.