Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: Chess pc program on super computer

Author: Robert Hyatt
Date: 09:38:16 08/05/05
On August 05, 2005 at 08:21:23, Vincent Diepeveen wrote:

>On August 04, 2005 at 10:48:05, Robert Hyatt wrote:
>
>>On August 04, 2005 at 08:16:24, Vincent Diepeveen wrote:
>>
>>>On August 04, 2005 at 02:50:32, Mimic wrote:
>>>
>>>>On August 04, 2005 at 02:37:20, Mark jones wrote:
>>>>
>>>>>Can you imagine how would Junior,shredder,fritz would have played if they where
>>>>>deployed on A super computer like this:
>>>>>http://www.top500.org/sublist/System.php?id=7605
>>>>>
>>>>>If this were possible not only it would kill all the humans I think it would
>>>>>have crushed Hydra to...
>>>>>What do you thisk about it? And was there an attempt to deploy an pc program
>>>>>on asuper computer?
>>>>
>>>>
>>>>How many Rpeak (GFlops) or Rmax (GFlops) on a normal personal computer ?
>>>
>>>Opteron delivers 2 flops per cycle per core.
>>>
>>>without using calculator:
>>>So a 2.2ghz dual core opteron delivers 2.2 * 2 * 2 = 4 * 2.2 = 8.8 gflop
>>>So a quad opteron delivers 4 * 8.8 = 35.2 gflop
>>>
>>>However the comparision is not fair. IBM is always quoting single precision
>>>calculations whereas majority of researchers uses double precision floating
>>>points.
>>
>>I do not believe that is true.  I'm not aware of _any_ floating point hardware
>
>Please check online sources, the 256 gflop quoted for CELL processor is based
>upon single precision floating point.

who mentioned the "cell" processor?  I'm talking about production processors
dating back to the IBM RS6000 and since...


>
>>today that does 32 bit IEEE math.  Every processor I have seen does internal
>>calculations in 64 (actually 80) bits of precision.  From the early IBM RS 6000
>
>At x87 floating point processors it is 80 bits for example. This will however
>not be there anymore in the future. Majority of the gflops of the
>chips such as itanium2 is usually not even following ansi-c specifications.

what does ansi-C have to do with IEEE floating point, which everyone today
_does_ follow?

>
>See my postings in hardware groups that have proven this.
>
>For example the intel c++ compiler has a special flag which you will have to
>give it, in order to calculate more accurately floating point with more
>precision than default 'shortcuts' it takes.
>
>>this has been true.  I once spent a couple of days trying to understand why
>>using REAL*4 vs REAL*8 in a FORTRAN program made _zero_ difference in how fast
>
>Yeah you're living in the 80s still.
>
>To wake you up nowadays we have things like SSE and SSE2 and also 3d now.
>
>For example your favourite opteron hardware has 3d now which splits virtually
>each 64 bits registers in 2 single precision floating point.
>
>See for example "amd 64 architecture programmes manual 1 application
>programming" page 270 5.3.6 floating point 3d now".
>
>It has a clear diagram to illustrate to you how vector operations work at the PC
>nowadays, and why they executing 1 instruction a cycle with 2 cycle latency
>still can reach 2 gflop single precision.
>
>>it ran, where on an older IBM /370 it made a significant difference since that
>>box actually had 32 bit and 64 bit hardware.
>
>A 64x64 bits multiplication at opteron eats 2 cycles and has a 128 bits
>result. However, with SSE2 you can not reach that accuracy but you can do
>2 flops per cycle.
>
>For some prime searching software mine the 64x64 bits multiplication delivering
>128 bits is more useful than 2 multiplications of 64x64 bits delivering 64.
>
>So there is a problem here that more bits is more useful, but it just doesn't
>deliver them.
>
>>>
>>>If you want to know exact definitions of what is a double precision floating
>>>point, look in the ansi-C definitions.
>>>
>>>In reality the researchers assume 64 bits 'double times a 64 bits double
>>>delivering a 64 bits double.
>>>
>>>In reality single precision is less than 32 bits times 32 bits delivering less
>>>than 32 bits worth of information.
>>
>>Why less than?  You lose exponent bits in either length, but the exponent is not
>>"lost information"...
>
>For ansi-c specifications on how precise floating point must be,
>see ansi_c definitions page 455 '7.library annex E'.
>
>It just has to represent something close to 1.000000001 * 10^37
>for a double. Which fits in less than 64 bits. 10 digits mantissa,
>10 power = 37.
>
>I might be a few bits off, but not many:
>  10 digits = 33 bits
>  -37..37   = 9 bits.

You are off more than that as no single precision can do 10 digits. IEEE has a
23+1 bit fraction.  And note that the ansi-c group does not drive the IEEE
standard.  The IEEE standard has been around for a _long_ time.  The C folks can
write what they want, but the FP guys are going to deliver IEEE.


>
>With 42-43 bits you get already very far.

IEEE 64 bit has 48 bits of fraction...

>
>So double (8 bytes type) has no need for 80 bits precision.
>Please note that long double has same definition.

who cares?  IEEE internal standards use 80 bits for intermediate results, and
store 64 bits for a final result.  But the FP registers in the PC are 80 bits
wide.

>
>To give one trivial example how you can be faster with less bits:
>
>most of those highend chips do not have a division instruction (for example
>itanium2) and therefore for divisions try to
>get away with less accuracy than ansi-c requires.
>
>Of course that speeds up software bigtime, because otherwise such a division
>approximation eats like 46 cycles or so.
>
>Itanium2 has a hardware instruction that can do 1/x and it keeps
>the data within registers. So with an algorithm they can then approach a
>division a little. The longer this algorithm runs the more accurate the
>approximation of the division instruction is.

Just like all the Crays...


>
>If you realize it can execute in theory 6 instructions pro cycle (though only 2
>integer instructions a cycle, which explains why it is not so fast for
>computerchess for itanium2, montecito should change that, but of course is too
>expensive for us to take serious) and then lose somewhere around 50 cycles,
>that's a potential loss of 300 instructions. So it matters really a lot
>in such codes to optimize the division to cheaper magnitudes, by losing
>accuracy.
>
>There are many of such examples.
>
>I refer to for example Professor Aad v/d Steen (Highperformance supercomputing)
>and his report of 1 july 2003, where he reports loss of accuracy with default
>compiler settings at itanium2. See www.sara.nl
>
>>
>>>
>>>Major cheating happens in those areas of course, for example the highend
>>>processors like itanium2, intel forgot to put a divide instruction at it.
>
>>Design decision.  Cray computers had no divide either.  Never caused them to be
>>less than the fastest floating point processor of their time...
>
>For your information, Cray is selling opteron clusters nowadays. Their main
>processor therefore is opteron.

For your information I am talking about the _cray_ processor.  Not the machines
they sell with other processors.  The "cray" architecture to be specific...

If I had been talking about opterons, or sparcs, or alphas, I would have said
so.  Cray has sold all of those.  But they also have their own unique
architecture for their "big iron".



>
>>>
>>>So they can do divisions in certain test programs faster by using some
>>>approximation algorithm delivering less decimals.
>>>
>>>So all those gflops mentionned are basically multiplication-add combinations.
>>>
>>>The CELL processor is supposed to deliver 256 gflop single precision, this is
>>>however less than 30 gflop double precision.
>>>
>>>In reality software isn't optimal so it will be less than 30 gflop.
>>>
>>>Still it is impressive for a processor that is supposed to get cheap.
>>>
>>>The expensive itanium2 1.5Ghz delivers for example 7 gflop on paper. That's also
>>>paper. SGI when presenting results at the 1 juli 2003 presentation of the 416
>>>processor itanium2 1.3Ghz cpu, made public there that effecitvely it is
>>>2 times faster in gflops for most applications than the previously 500Mhz MIPS
>>>R14000.
>>>
>>>On paper the MIPS delivers 1 gflop at 500Mhz and on paper the 1.3Ghz itanium2
>>>delivers 5.2 gflop.
>>>
>>>Practical 2 times faster according to SGI.
>>>
>>>NASA had a similar report initially for their own software when running at a 512
>>>processor partition.
>>>
>>>So all those gflops you have to take with some reservation. Reality is those
>>>supercomputers usually idle for 70% in the first year, they idle 50% in the
>>>second and 3d year, and when they are outdated in the 4th year they are idle for
>>>30%. That is of course all reserved times added and all 'system processors' not
>>>taken into account. In reality they idle more.
>>>
>>>So many of those supercomputers are paper hero's which the researchers litterary
>>>use to "run their application faster than it would run on a pc".
>>>
>>>There is real few applications that are utmost optimized. Certain matrix
>>>calculation type libraries are pretty good and are pretty optimal for it.
>>>
>>>For those researchers those gflops *really* matter.
>>>
>>>You can count them at 1 hand.
>>
>>Wrong.  You need to get out more.  Labs like Livermore and Los Alamos have
>>_thousands_ of carefully hand-optimized programs for specific computer
>>architectures.  They care whether a program runs in 2 weeks or 6 weeks.  Or
>>longer.
>
>A handful of applications is well optimized on the planet. 99.9% isn't.
>
>Los Alamos is basically using some matrix libraries which are well optimized,
>see above.

absolutely wrong, but I'm not going to go into why.  Los Alamos has many
applications, written in-house, that are millions of lines of code _each_.


>
>Your 'wrong' doesn't apply at all.
>
>>
>>>
>>>What matters is they have the POSSIBILITY to run their application real real
>>>fast if they want to, and that is real important.
>>>
>>>This big 12288 ibm supercomputer 'blue gene' boxes (6 racks of 2048 processors)
>>>has a cost price of just 6 million euro.
>>>
>>>That's real little if you consider the huge calculation power it delivers for
>>>those researchers it matters for.
>>>
>>>Best usages of those supercomputers are nucleair explosions (i did not say
>>>university groningen is running nucleair simulations) and calculating for
>>>example where electrons in materials are.
>>>
>>>Usually the homepage supports 'biologic' supercomputers. In reality just real
>>>little system time goes to medicines and biologic research, about 0.5% system
>>>time, according to this supercomputer report europe (including all scientific
>>>supercomputers in entire europe).
>>>
>>>Amazingly much system time goes to all kind of weather or extreme climate
>>>simulations. Like they have been calculating world wide so so much already at
>>>what the height of the seawater will become. I became real real sick from that,
>>>as i could not test diep until the world champs itself, because some weather
>>>simulation was nonstop running.
>>>
>>>After they had run at 350+ processors for months (450000 cpu hours or so,
>>>according to official project papers) and after they had created a new discovery
>>>series from the output that sea water would rise 1 meter the coming 100 years,
>>>they discovered a small bug in the initializing data.
>>>
>>>They had initialized the sea water 1 meter too high when starting the test half
>>>a year earlier.
>>>
>>>This was the reason diep ran buggy the first 7 rounds in world champs 2003.
>>>
>>>Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.