Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Who can update about new 64 bits chip?

Author: Vincent Diepeveen

Date: 09:40:54 08/25/02

Go up one level in this thread


On August 25, 2002 at 10:54:08, Robert Hyatt wrote:

>On August 25, 2002 at 09:46:01, Vincent Diepeveen wrote:
>
>>On August 24, 2002 at 19:23:23, leonid wrote:
>>
>>>Hi,
>>
>>>It look like that affordable 64 bits chip will come soon, but how soon? If
>>>somebody can update about expectation of this chip, it will be nice. Every
>>>technical detail about next 64 bits chip is welcome. Like number of registers
>>>and every new and special feature on new chip that You know about. Your
>>>impression about new chip influence on everything in programming. Is the Linux
>>>ready for new 64 bits arrival...
>>
>>All OSes will be ready long before it is on the market as i don't doubt they
>>test both windows and linux already for over a year at it now.
>>
>>There has been little released about Hammer (let's skip McKinley as
>>that price range is more than my car is worth).
>>
>>What has been released a few days ago regarding hammer
>>is that simple operations like add, which costs on the K7 about 4 cycle
>>latency, it is costing only 3 cycle latency at the Hammer in 32 bits
>>mode. In 64 bits mode it is 4 cycle latency.
>>
>>There is a good reason to assume the 64 bits instructions are 33% slower
>>than 32 bits.
>
>This is thinking about it the _wrong_ way.  If you really _need_ to do a
>64 bit operation, then hammer is not 33% slower, it almost twice as fast.
>Because it will need one 4 cycle instruction while using two 32-bit
>instructions would be almost twice as slow on a 32 bit machine...

From my viewpoint it definitely is not the wrong way. I'm 32 bits.
If i get a new chip i want to see whether i can get faster. By encrypting
a program more, such that source code gets unreadable but that a single
add is not only adding a 4 to the low 32 bits but also another value to the
high 32 bits, that's obviously not a new trick. But it definitely makes
a program unreadable (IE see bitboards).

Now that chessprograms are getting bigger and bigger, at least mine is,
the issue of readability is very important. If you see a chesspattern
then the code must be trivial to read.

Doesn't take away that for me there is a win in the hammer running 32
bits versus 64. As it's 3 cycles versus 4.

If your datastructure was first slowed down in order to get faster now,
i won't stop you...

>The challenge on a 64 bit machine is _not_ in running a 32 bit program on it.
>That would be pointless.  The challenge is really using 64 bit words so that
>the 2x data density inside the CPU actually gains you something.



>>
>>Obviously different versions for different markets get released from
>>hammer. Default hammer is going to run dual, but a version with way
>>more L2 cache will run up to 8 processors (if i understand all this well).
>>
>>Kind of 'xeon' versus 'P3' principle (P4 only is running dual at most).
>>
>>I wouldn't count on the 1 to 2 MB L2 cache Hammers to be within price
>>range of the average poster here at CCC. Hopefully 8cpu capable versions
>>will be there with 256KB L2 cache.
>
>The reason for the larger caches on Xeons is pretty obvious.  It is a way to
>offset the memory bandwidth issue by locating more data right beside the CPU
>rather than hundreds of nanoseconds away in RAM.
>
>You have two choices.  Bigger cache or more memory bandwidth.  The former
>is actually far cheaper than the latter.
>
>
>>
>>The multiprocessor versions of the Hammer will be what most guess it is.
>>It is more like a NUMA system. Getting data from local memory is very
>>fast (faster than RAM is now), but on the other hand getting data from
>>other nodes is a lot slower.
>
>I don't think local memory is going to be faster.  DRAM still has a definite
>latency that can't be reduced so long as capacitors are the choice to store
>bits of info.
>
>I also don't think that low-end dual/quad/8-way boxes are going to be NUMA.
>That won't keep the price low enough.  NUMA switches are non-trivial in terms
>of cost, even for a small number of connections, because the switch has to be
>scalable to larger numbers of processors for higher performance.
>
>>
>>So crafty needs a complete redesign in order to be faster on 8 processors
>>than it is at 1 at this system. I'm taking this example because crafty,
>>in comparision to other programs is getting a decent speedup at 4
>>processors with fast memory access right now.
>>
>
>
>
>It really doesn't need a complete re-design.  The current approach is actually
>not that far off from what is needed.  The real issues for _any_ program such as
>crafty are as follows:
>
>1.  Probably a good bit of data replication.  IE the attack generation tables,
>evaluation tables, any memory that is "constant" in nature for a search.
>Replicating this gets the frequently used constant data as close to the CPU as
>is physically possible, so that there is no penalty whatsoever for this kind of
>data, which is actually fairly large.  Things like Zobrist random numbers also
>fit this.
>
>2.  Changing the way split blocks are allocated.  They will need to be allocated
>scattered all over the NUMA cluster, rather than in one chunk in one processor's
>local memory.  Doing this isn't too hard.
>
>3.  Changing how splits are done.  When a processor splits, the split block is
>going to be in its local memory, which is logical.  It needs to split with
>processors that are "close" to it on the NUMA switch, because they will need to
>also access the data and it needs to be as close to them as is physically
>possible to reduce latency.  I already have some controls for this, but I would
>probably add a "processor directory" for each processor, showing each the best
>group of processors to split with...
>
>4.  Doing 3 also solves a lot of issues with locks.  Locks need to be as close
>to a NUMA processor as physically possible, for obvious reasons.
>
>Doing those things are (a) not really that difficult and (b) will produce a
>pretty reasonable NUMA application.  I've been working with Compaq off and on
>studying their NUMA alpha implementation.
>
>The main issue for NUMA is to try to figure out _how_ to accomplish something
>that will work, rather than trying to figure out why it is not going to be
>efficient.  NUMA can work with some thought, from experience.  Lots of
>experience.
>
>>It is not difficult to show that one needs a complete redesign. Up to
>>4 processors you can still get away with a few small changes, but
>>with 8 processors you won't.
>
>
>
>This depends.  There are plenty of 8 and 16 cpu non-NUMA machines around.
>I have run on several with no notable issues.  _every_ machine with 8 or
>more processors won't be NUMA.  Many will, but some won't.  Fortunately
>what works on NUMA can easily be tuned to work even better when the architecture
>is not NUMA.
>
>
>
>>
>>>Cheers,
>>>Leonid.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.