Author: Robert Hyatt
Date: 09:38:16 08/05/05
Go up one level in this thread
On August 05, 2005 at 08:21:23, Vincent Diepeveen wrote: >On August 04, 2005 at 10:48:05, Robert Hyatt wrote: > >>On August 04, 2005 at 08:16:24, Vincent Diepeveen wrote: >> >>>On August 04, 2005 at 02:50:32, Mimic wrote: >>> >>>>On August 04, 2005 at 02:37:20, Mark jones wrote: >>>> >>>>>Can you imagine how would Junior,shredder,fritz would have played if they where >>>>>deployed on A super computer like this: >>>>>http://www.top500.org/sublist/System.php?id=7605 >>>>> >>>>>If this were possible not only it would kill all the humans I think it would >>>>>have crushed Hydra to... >>>>>What do you thisk about it? And was there an attempt to deploy an pc program >>>>>on asuper computer? >>>> >>>> >>>>How many Rpeak (GFlops) or Rmax (GFlops) on a normal personal computer ? >>> >>>Opteron delivers 2 flops per cycle per core. >>> >>>without using calculator: >>>So a 2.2ghz dual core opteron delivers 2.2 * 2 * 2 = 4 * 2.2 = 8.8 gflop >>>So a quad opteron delivers 4 * 8.8 = 35.2 gflop >>> >>>However the comparision is not fair. IBM is always quoting single precision >>>calculations whereas majority of researchers uses double precision floating >>>points. >> >>I do not believe that is true. I'm not aware of _any_ floating point hardware > >Please check online sources, the 256 gflop quoted for CELL processor is based >upon single precision floating point. who mentioned the "cell" processor? I'm talking about production processors dating back to the IBM RS6000 and since... > >>today that does 32 bit IEEE math. Every processor I have seen does internal >>calculations in 64 (actually 80) bits of precision. From the early IBM RS 6000 > >At x87 floating point processors it is 80 bits for example. This will however >not be there anymore in the future. Majority of the gflops of the >chips such as itanium2 is usually not even following ansi-c specifications. what does ansi-C have to do with IEEE floating point, which everyone today _does_ follow? > >See my postings in hardware groups that have proven this. > >For example the intel c++ compiler has a special flag which you will have to >give it, in order to calculate more accurately floating point with more >precision than default 'shortcuts' it takes. > >>this has been true. I once spent a couple of days trying to understand why >>using REAL*4 vs REAL*8 in a FORTRAN program made _zero_ difference in how fast > >Yeah you're living in the 80s still. > >To wake you up nowadays we have things like SSE and SSE2 and also 3d now. > >For example your favourite opteron hardware has 3d now which splits virtually >each 64 bits registers in 2 single precision floating point. > >See for example "amd 64 architecture programmes manual 1 application >programming" page 270 5.3.6 floating point 3d now". > >It has a clear diagram to illustrate to you how vector operations work at the PC >nowadays, and why they executing 1 instruction a cycle with 2 cycle latency >still can reach 2 gflop single precision. > >>it ran, where on an older IBM /370 it made a significant difference since that >>box actually had 32 bit and 64 bit hardware. > >A 64x64 bits multiplication at opteron eats 2 cycles and has a 128 bits >result. However, with SSE2 you can not reach that accuracy but you can do >2 flops per cycle. > >For some prime searching software mine the 64x64 bits multiplication delivering >128 bits is more useful than 2 multiplications of 64x64 bits delivering 64. > >So there is a problem here that more bits is more useful, but it just doesn't >deliver them. > >>> >>>If you want to know exact definitions of what is a double precision floating >>>point, look in the ansi-C definitions. >>> >>>In reality the researchers assume 64 bits 'double times a 64 bits double >>>delivering a 64 bits double. >>> >>>In reality single precision is less than 32 bits times 32 bits delivering less >>>than 32 bits worth of information. >> >>Why less than? You lose exponent bits in either length, but the exponent is not >>"lost information"... > >For ansi-c specifications on how precise floating point must be, >see ansi_c definitions page 455 '7.library annex E'. > >It just has to represent something close to 1.000000001 * 10^37 >for a double. Which fits in less than 64 bits. 10 digits mantissa, >10 power = 37. > >I might be a few bits off, but not many: > 10 digits = 33 bits > -37..37 = 9 bits. You are off more than that as no single precision can do 10 digits. IEEE has a 23+1 bit fraction. And note that the ansi-c group does not drive the IEEE standard. The IEEE standard has been around for a _long_ time. The C folks can write what they want, but the FP guys are going to deliver IEEE. > >With 42-43 bits you get already very far. IEEE 64 bit has 48 bits of fraction... > >So double (8 bytes type) has no need for 80 bits precision. >Please note that long double has same definition. who cares? IEEE internal standards use 80 bits for intermediate results, and store 64 bits for a final result. But the FP registers in the PC are 80 bits wide. > >To give one trivial example how you can be faster with less bits: > >most of those highend chips do not have a division instruction (for example >itanium2) and therefore for divisions try to >get away with less accuracy than ansi-c requires. > >Of course that speeds up software bigtime, because otherwise such a division >approximation eats like 46 cycles or so. > >Itanium2 has a hardware instruction that can do 1/x and it keeps >the data within registers. So with an algorithm they can then approach a >division a little. The longer this algorithm runs the more accurate the >approximation of the division instruction is. Just like all the Crays... > >If you realize it can execute in theory 6 instructions pro cycle (though only 2 >integer instructions a cycle, which explains why it is not so fast for >computerchess for itanium2, montecito should change that, but of course is too >expensive for us to take serious) and then lose somewhere around 50 cycles, >that's a potential loss of 300 instructions. So it matters really a lot >in such codes to optimize the division to cheaper magnitudes, by losing >accuracy. > >There are many of such examples. > >I refer to for example Professor Aad v/d Steen (Highperformance supercomputing) >and his report of 1 july 2003, where he reports loss of accuracy with default >compiler settings at itanium2. See www.sara.nl > >> >>> >>>Major cheating happens in those areas of course, for example the highend >>>processors like itanium2, intel forgot to put a divide instruction at it. > >>Design decision. Cray computers had no divide either. Never caused them to be >>less than the fastest floating point processor of their time... > >For your information, Cray is selling opteron clusters nowadays. Their main >processor therefore is opteron. For your information I am talking about the _cray_ processor. Not the machines they sell with other processors. The "cray" architecture to be specific... If I had been talking about opterons, or sparcs, or alphas, I would have said so. Cray has sold all of those. But they also have their own unique architecture for their "big iron". > >>> >>>So they can do divisions in certain test programs faster by using some >>>approximation algorithm delivering less decimals. >>> >>>So all those gflops mentionned are basically multiplication-add combinations. >>> >>>The CELL processor is supposed to deliver 256 gflop single precision, this is >>>however less than 30 gflop double precision. >>> >>>In reality software isn't optimal so it will be less than 30 gflop. >>> >>>Still it is impressive for a processor that is supposed to get cheap. >>> >>>The expensive itanium2 1.5Ghz delivers for example 7 gflop on paper. That's also >>>paper. SGI when presenting results at the 1 juli 2003 presentation of the 416 >>>processor itanium2 1.3Ghz cpu, made public there that effecitvely it is >>>2 times faster in gflops for most applications than the previously 500Mhz MIPS >>>R14000. >>> >>>On paper the MIPS delivers 1 gflop at 500Mhz and on paper the 1.3Ghz itanium2 >>>delivers 5.2 gflop. >>> >>>Practical 2 times faster according to SGI. >>> >>>NASA had a similar report initially for their own software when running at a 512 >>>processor partition. >>> >>>So all those gflops you have to take with some reservation. Reality is those >>>supercomputers usually idle for 70% in the first year, they idle 50% in the >>>second and 3d year, and when they are outdated in the 4th year they are idle for >>>30%. That is of course all reserved times added and all 'system processors' not >>>taken into account. In reality they idle more. >>> >>>So many of those supercomputers are paper hero's which the researchers litterary >>>use to "run their application faster than it would run on a pc". >>> >>>There is real few applications that are utmost optimized. Certain matrix >>>calculation type libraries are pretty good and are pretty optimal for it. >>> >>>For those researchers those gflops *really* matter. >>> >>>You can count them at 1 hand. >> >>Wrong. You need to get out more. Labs like Livermore and Los Alamos have >>_thousands_ of carefully hand-optimized programs for specific computer >>architectures. They care whether a program runs in 2 weeks or 6 weeks. Or >>longer. > >A handful of applications is well optimized on the planet. 99.9% isn't. > >Los Alamos is basically using some matrix libraries which are well optimized, >see above. absolutely wrong, but I'm not going to go into why. Los Alamos has many applications, written in-house, that are millions of lines of code _each_. > >Your 'wrong' doesn't apply at all. > >> >>> >>>What matters is they have the POSSIBILITY to run their application real real >>>fast if they want to, and that is real important. >>> >>>This big 12288 ibm supercomputer 'blue gene' boxes (6 racks of 2048 processors) >>>has a cost price of just 6 million euro. >>> >>>That's real little if you consider the huge calculation power it delivers for >>>those researchers it matters for. >>> >>>Best usages of those supercomputers are nucleair explosions (i did not say >>>university groningen is running nucleair simulations) and calculating for >>>example where electrons in materials are. >>> >>>Usually the homepage supports 'biologic' supercomputers. In reality just real >>>little system time goes to medicines and biologic research, about 0.5% system >>>time, according to this supercomputer report europe (including all scientific >>>supercomputers in entire europe). >>> >>>Amazingly much system time goes to all kind of weather or extreme climate >>>simulations. Like they have been calculating world wide so so much already at >>>what the height of the seawater will become. I became real real sick from that, >>>as i could not test diep until the world champs itself, because some weather >>>simulation was nonstop running. >>> >>>After they had run at 350+ processors for months (450000 cpu hours or so, >>>according to official project papers) and after they had created a new discovery >>>series from the output that sea water would rise 1 meter the coming 100 years, >>>they discovered a small bug in the initializing data. >>> >>>They had initialized the sea water 1 meter too high when starting the test half >>>a year earlier. >>> >>>This was the reason diep ran buggy the first 7 rounds in world champs 2003. >>> >>>Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.