Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Who can update about new 64 bits chip?

Author: Robert Hyatt

Date: 07:54:08 08/25/02

Go up one level in this thread


On August 25, 2002 at 09:46:01, Vincent Diepeveen wrote:

>On August 24, 2002 at 19:23:23, leonid wrote:
>
>>Hi,
>
>>It look like that affordable 64 bits chip will come soon, but how soon? If
>>somebody can update about expectation of this chip, it will be nice. Every
>>technical detail about next 64 bits chip is welcome. Like number of registers
>>and every new and special feature on new chip that You know about. Your
>>impression about new chip influence on everything in programming. Is the Linux
>>ready for new 64 bits arrival...
>
>All OSes will be ready long before it is on the market as i don't doubt they
>test both windows and linux already for over a year at it now.
>
>There has been little released about Hammer (let's skip McKinley as
>that price range is more than my car is worth).
>
>What has been released a few days ago regarding hammer
>is that simple operations like add, which costs on the K7 about 4 cycle
>latency, it is costing only 3 cycle latency at the Hammer in 32 bits
>mode. In 64 bits mode it is 4 cycle latency.
>
>There is a good reason to assume the 64 bits instructions are 33% slower
>than 32 bits.

This is thinking about it the _wrong_ way.  If you really _need_ to do a
64 bit operation, then hammer is not 33% slower, it almost twice as fast.
Because it will need one 4 cycle instruction while using two 32-bit
instructions would be almost twice as slow on a 32 bit machine...

The challenge on a 64 bit machine is _not_ in running a 32 bit program on it.
That would be pointless.  The challenge is really using 64 bit words so that
the 2x data density inside the CPU actually gains you something.




>
>Obviously different versions for different markets get released from
>hammer. Default hammer is going to run dual, but a version with way
>more L2 cache will run up to 8 processors (if i understand all this well).
>
>Kind of 'xeon' versus 'P3' principle (P4 only is running dual at most).
>
>I wouldn't count on the 1 to 2 MB L2 cache Hammers to be within price
>range of the average poster here at CCC. Hopefully 8cpu capable versions
>will be there with 256KB L2 cache.

The reason for the larger caches on Xeons is pretty obvious.  It is a way to
offset the memory bandwidth issue by locating more data right beside the CPU
rather than hundreds of nanoseconds away in RAM.

You have two choices.  Bigger cache or more memory bandwidth.  The former
is actually far cheaper than the latter.


>
>The multiprocessor versions of the Hammer will be what most guess it is.
>It is more like a NUMA system. Getting data from local memory is very
>fast (faster than RAM is now), but on the other hand getting data from
>other nodes is a lot slower.

I don't think local memory is going to be faster.  DRAM still has a definite
latency that can't be reduced so long as capacitors are the choice to store
bits of info.

I also don't think that low-end dual/quad/8-way boxes are going to be NUMA.
That won't keep the price low enough.  NUMA switches are non-trivial in terms
of cost, even for a small number of connections, because the switch has to be
scalable to larger numbers of processors for higher performance.

>
>So crafty needs a complete redesign in order to be faster on 8 processors
>than it is at 1 at this system. I'm taking this example because crafty,
>in comparision to other programs is getting a decent speedup at 4
>processors with fast memory access right now.
>



It really doesn't need a complete re-design.  The current approach is actually
not that far off from what is needed.  The real issues for _any_ program such as
crafty are as follows:

1.  Probably a good bit of data replication.  IE the attack generation tables,
evaluation tables, any memory that is "constant" in nature for a search.
Replicating this gets the frequently used constant data as close to the CPU as
is physically possible, so that there is no penalty whatsoever for this kind of
data, which is actually fairly large.  Things like Zobrist random numbers also
fit this.

2.  Changing the way split blocks are allocated.  They will need to be allocated
scattered all over the NUMA cluster, rather than in one chunk in one processor's
local memory.  Doing this isn't too hard.

3.  Changing how splits are done.  When a processor splits, the split block is
going to be in its local memory, which is logical.  It needs to split with
processors that are "close" to it on the NUMA switch, because they will need to
also access the data and it needs to be as close to them as is physically
possible to reduce latency.  I already have some controls for this, but I would
probably add a "processor directory" for each processor, showing each the best
group of processors to split with...

4.  Doing 3 also solves a lot of issues with locks.  Locks need to be as close
to a NUMA processor as physically possible, for obvious reasons.

Doing those things are (a) not really that difficult and (b) will produce a
pretty reasonable NUMA application.  I've been working with Compaq off and on
studying their NUMA alpha implementation.

The main issue for NUMA is to try to figure out _how_ to accomplish something
that will work, rather than trying to figure out why it is not going to be
efficient.  NUMA can work with some thought, from experience.  Lots of
experience.

>It is not difficult to show that one needs a complete redesign. Up to
>4 processors you can still get away with a few small changes, but
>with 8 processors you won't.



This depends.  There are plenty of 8 and 16 cpu non-NUMA machines around.
I have run on several with no notable issues.  _every_ machine with 8 or
more processors won't be NUMA.  Many will, but some won't.  Fortunately
what works on NUMA can easily be tuned to work even better when the architecture
is not NUMA.



>
>>Cheers,
>>Leonid.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.