Author: Robert Hyatt
Date: 18:39:55 08/25/02
Go up one level in this thread
On August 25, 2002 at 12:40:54, Vincent Diepeveen wrote: >On August 25, 2002 at 10:54:08, Robert Hyatt wrote: > >>On August 25, 2002 at 09:46:01, Vincent Diepeveen wrote: >> >>>On August 24, 2002 at 19:23:23, leonid wrote: >>> >>>>Hi, >>> >>>>It look like that affordable 64 bits chip will come soon, but how soon? If >>>>somebody can update about expectation of this chip, it will be nice. Every >>>>technical detail about next 64 bits chip is welcome. Like number of registers >>>>and every new and special feature on new chip that You know about. Your >>>>impression about new chip influence on everything in programming. Is the Linux >>>>ready for new 64 bits arrival... >>> >>>All OSes will be ready long before it is on the market as i don't doubt they >>>test both windows and linux already for over a year at it now. >>> >>>There has been little released about Hammer (let's skip McKinley as >>>that price range is more than my car is worth). >>> >>>What has been released a few days ago regarding hammer >>>is that simple operations like add, which costs on the K7 about 4 cycle >>>latency, it is costing only 3 cycle latency at the Hammer in 32 bits >>>mode. In 64 bits mode it is 4 cycle latency. >>> >>>There is a good reason to assume the 64 bits instructions are 33% slower >>>than 32 bits. >> >>This is thinking about it the _wrong_ way. If you really _need_ to do a >>64 bit operation, then hammer is not 33% slower, it almost twice as fast. >>Because it will need one 4 cycle instruction while using two 32-bit >>instructions would be almost twice as slow on a 32 bit machine... > >From my viewpoint it definitely is not the wrong way. I'm 32 bits. Then for _you_ 64 bit machines might not make sense. But for those of use using 64 bit stuff, they make a lot of sends... If you sit around with a 32 bit engine for 10 years, you will be left behind as the 64 bit chips replace everything and you get no benefit. You have to design the algorithm to maximize performance on a given architecture. If you don't, you simply have to tolerate the performance hit... >If i get a new chip i want to see whether i can get faster. By encrypting >a program more, such that source code gets unreadable but that a single >add is not only adding a 4 to the low 32 bits but also another value to the >high 32 bits, that's obviously not a new trick. But it definitely makes >a program unreadable (IE see bitboards). Bitboards don't make a program unreadable to me. I read mine every day. Others do as well. APL might make a program unreadable until you learn APL and become proficient in it with enough experience to back you up so that things don't look like gibberish... > >Now that chessprograms are getting bigger and bigger, at least mine is, >the issue of readability is very important. If you see a chesspattern >then the code must be trivial to read. > There is this thing called "a comment" that lets you be as clear as you want in explaining what a piece of code is doing... :) >Doesn't take away that for me there is a win in the hammer running 32 >bits versus 64. As it's 3 cycles versus 4. > >If your datastructure was first slowed down in order to get faster now, >i won't stop you... It wasn't "slowed down in order to get faster now." It was designed with the future in mind. 64 bits has been "the future" since the first Cray in the 1970's... > >>The challenge on a 64 bit machine is _not_ in running a 32 bit program on it. >>That would be pointless. The challenge is really using 64 bit words so that >>the 2x data density inside the CPU actually gains you something. > > > >>> >>>Obviously different versions for different markets get released from >>>hammer. Default hammer is going to run dual, but a version with way >>>more L2 cache will run up to 8 processors (if i understand all this well). >>> >>>Kind of 'xeon' versus 'P3' principle (P4 only is running dual at most). >>> >>>I wouldn't count on the 1 to 2 MB L2 cache Hammers to be within price >>>range of the average poster here at CCC. Hopefully 8cpu capable versions >>>will be there with 256KB L2 cache. >> >>The reason for the larger caches on Xeons is pretty obvious. It is a way to >>offset the memory bandwidth issue by locating more data right beside the CPU >>rather than hundreds of nanoseconds away in RAM. >> >>You have two choices. Bigger cache or more memory bandwidth. The former >>is actually far cheaper than the latter. >> >> >>> >>>The multiprocessor versions of the Hammer will be what most guess it is. >>>It is more like a NUMA system. Getting data from local memory is very >>>fast (faster than RAM is now), but on the other hand getting data from >>>other nodes is a lot slower. >> >>I don't think local memory is going to be faster. DRAM still has a definite >>latency that can't be reduced so long as capacitors are the choice to store >>bits of info. >> >>I also don't think that low-end dual/quad/8-way boxes are going to be NUMA. >>That won't keep the price low enough. NUMA switches are non-trivial in terms >>of cost, even for a small number of connections, because the switch has to be >>scalable to larger numbers of processors for higher performance. >> >>> >>>So crafty needs a complete redesign in order to be faster on 8 processors >>>than it is at 1 at this system. I'm taking this example because crafty, >>>in comparision to other programs is getting a decent speedup at 4 >>>processors with fast memory access right now. >>> >> >> >> >>It really doesn't need a complete re-design. The current approach is actually >>not that far off from what is needed. The real issues for _any_ program such as >>crafty are as follows: >> >>1. Probably a good bit of data replication. IE the attack generation tables, >>evaluation tables, any memory that is "constant" in nature for a search. >>Replicating this gets the frequently used constant data as close to the CPU as >>is physically possible, so that there is no penalty whatsoever for this kind of >>data, which is actually fairly large. Things like Zobrist random numbers also >>fit this. >> >>2. Changing the way split blocks are allocated. They will need to be allocated >>scattered all over the NUMA cluster, rather than in one chunk in one processor's >>local memory. Doing this isn't too hard. >> >>3. Changing how splits are done. When a processor splits, the split block is >>going to be in its local memory, which is logical. It needs to split with >>processors that are "close" to it on the NUMA switch, because they will need to >>also access the data and it needs to be as close to them as is physically >>possible to reduce latency. I already have some controls for this, but I would >>probably add a "processor directory" for each processor, showing each the best >>group of processors to split with... >> >>4. Doing 3 also solves a lot of issues with locks. Locks need to be as close >>to a NUMA processor as physically possible, for obvious reasons. >> >>Doing those things are (a) not really that difficult and (b) will produce a >>pretty reasonable NUMA application. I've been working with Compaq off and on >>studying their NUMA alpha implementation. >> >>The main issue for NUMA is to try to figure out _how_ to accomplish something >>that will work, rather than trying to figure out why it is not going to be >>efficient. NUMA can work with some thought, from experience. Lots of >>experience. >> >>>It is not difficult to show that one needs a complete redesign. Up to >>>4 processors you can still get away with a few small changes, but >>>with 8 processors you won't. >> >> >> >>This depends. There are plenty of 8 and 16 cpu non-NUMA machines around. >>I have run on several with no notable issues. _every_ machine with 8 or >>more processors won't be NUMA. Many will, but some won't. Fortunately >>what works on NUMA can easily be tuned to work even better when the architecture >>is not NUMA. >> >> >> >>> >>>>Cheers, >>>>Leonid.
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.