Author: Vincent Diepeveen
Date: 09:40:54 08/25/02
Go up one level in this thread
On August 25, 2002 at 10:54:08, Robert Hyatt wrote: >On August 25, 2002 at 09:46:01, Vincent Diepeveen wrote: > >>On August 24, 2002 at 19:23:23, leonid wrote: >> >>>Hi, >> >>>It look like that affordable 64 bits chip will come soon, but how soon? If >>>somebody can update about expectation of this chip, it will be nice. Every >>>technical detail about next 64 bits chip is welcome. Like number of registers >>>and every new and special feature on new chip that You know about. Your >>>impression about new chip influence on everything in programming. Is the Linux >>>ready for new 64 bits arrival... >> >>All OSes will be ready long before it is on the market as i don't doubt they >>test both windows and linux already for over a year at it now. >> >>There has been little released about Hammer (let's skip McKinley as >>that price range is more than my car is worth). >> >>What has been released a few days ago regarding hammer >>is that simple operations like add, which costs on the K7 about 4 cycle >>latency, it is costing only 3 cycle latency at the Hammer in 32 bits >>mode. In 64 bits mode it is 4 cycle latency. >> >>There is a good reason to assume the 64 bits instructions are 33% slower >>than 32 bits. > >This is thinking about it the _wrong_ way. If you really _need_ to do a >64 bit operation, then hammer is not 33% slower, it almost twice as fast. >Because it will need one 4 cycle instruction while using two 32-bit >instructions would be almost twice as slow on a 32 bit machine... From my viewpoint it definitely is not the wrong way. I'm 32 bits. If i get a new chip i want to see whether i can get faster. By encrypting a program more, such that source code gets unreadable but that a single add is not only adding a 4 to the low 32 bits but also another value to the high 32 bits, that's obviously not a new trick. But it definitely makes a program unreadable (IE see bitboards). Now that chessprograms are getting bigger and bigger, at least mine is, the issue of readability is very important. If you see a chesspattern then the code must be trivial to read. Doesn't take away that for me there is a win in the hammer running 32 bits versus 64. As it's 3 cycles versus 4. If your datastructure was first slowed down in order to get faster now, i won't stop you... >The challenge on a 64 bit machine is _not_ in running a 32 bit program on it. >That would be pointless. The challenge is really using 64 bit words so that >the 2x data density inside the CPU actually gains you something. >> >>Obviously different versions for different markets get released from >>hammer. Default hammer is going to run dual, but a version with way >>more L2 cache will run up to 8 processors (if i understand all this well). >> >>Kind of 'xeon' versus 'P3' principle (P4 only is running dual at most). >> >>I wouldn't count on the 1 to 2 MB L2 cache Hammers to be within price >>range of the average poster here at CCC. Hopefully 8cpu capable versions >>will be there with 256KB L2 cache. > >The reason for the larger caches on Xeons is pretty obvious. It is a way to >offset the memory bandwidth issue by locating more data right beside the CPU >rather than hundreds of nanoseconds away in RAM. > >You have two choices. Bigger cache or more memory bandwidth. The former >is actually far cheaper than the latter. > > >> >>The multiprocessor versions of the Hammer will be what most guess it is. >>It is more like a NUMA system. Getting data from local memory is very >>fast (faster than RAM is now), but on the other hand getting data from >>other nodes is a lot slower. > >I don't think local memory is going to be faster. DRAM still has a definite >latency that can't be reduced so long as capacitors are the choice to store >bits of info. > >I also don't think that low-end dual/quad/8-way boxes are going to be NUMA. >That won't keep the price low enough. NUMA switches are non-trivial in terms >of cost, even for a small number of connections, because the switch has to be >scalable to larger numbers of processors for higher performance. > >> >>So crafty needs a complete redesign in order to be faster on 8 processors >>than it is at 1 at this system. I'm taking this example because crafty, >>in comparision to other programs is getting a decent speedup at 4 >>processors with fast memory access right now. >> > > > >It really doesn't need a complete re-design. The current approach is actually >not that far off from what is needed. The real issues for _any_ program such as >crafty are as follows: > >1. Probably a good bit of data replication. IE the attack generation tables, >evaluation tables, any memory that is "constant" in nature for a search. >Replicating this gets the frequently used constant data as close to the CPU as >is physically possible, so that there is no penalty whatsoever for this kind of >data, which is actually fairly large. Things like Zobrist random numbers also >fit this. > >2. Changing the way split blocks are allocated. They will need to be allocated >scattered all over the NUMA cluster, rather than in one chunk in one processor's >local memory. Doing this isn't too hard. > >3. Changing how splits are done. When a processor splits, the split block is >going to be in its local memory, which is logical. It needs to split with >processors that are "close" to it on the NUMA switch, because they will need to >also access the data and it needs to be as close to them as is physically >possible to reduce latency. I already have some controls for this, but I would >probably add a "processor directory" for each processor, showing each the best >group of processors to split with... > >4. Doing 3 also solves a lot of issues with locks. Locks need to be as close >to a NUMA processor as physically possible, for obvious reasons. > >Doing those things are (a) not really that difficult and (b) will produce a >pretty reasonable NUMA application. I've been working with Compaq off and on >studying their NUMA alpha implementation. > >The main issue for NUMA is to try to figure out _how_ to accomplish something >that will work, rather than trying to figure out why it is not going to be >efficient. NUMA can work with some thought, from experience. Lots of >experience. > >>It is not difficult to show that one needs a complete redesign. Up to >>4 processors you can still get away with a few small changes, but >>with 8 processors you won't. > > > >This depends. There are plenty of 8 and 16 cpu non-NUMA machines around. >I have run on several with no notable issues. _every_ machine with 8 or >more processors won't be NUMA. Many will, but some won't. Fortunately >what works on NUMA can easily be tuned to work even better when the architecture >is not NUMA. > > > >> >>>Cheers, >>>Leonid.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.