Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Chess pc program on super computer

Author: Vincent Diepeveen

Date: 05:21:23 08/05/05

Go up one level in this thread


On August 04, 2005 at 10:48:05, Robert Hyatt wrote:

>On August 04, 2005 at 08:16:24, Vincent Diepeveen wrote:
>
>>On August 04, 2005 at 02:50:32, Mimic wrote:
>>
>>>On August 04, 2005 at 02:37:20, Mark jones wrote:
>>>
>>>>Can you imagine how would Junior,shredder,fritz would have played if they where
>>>>deployed on A super computer like this:
>>>>http://www.top500.org/sublist/System.php?id=7605
>>>>
>>>>If this were possible not only it would kill all the humans I think it would
>>>>have crushed Hydra to...
>>>>What do you thisk about it? And was there an attempt to deploy an pc program
>>>>on asuper computer?
>>>
>>>
>>>How many Rpeak (GFlops) or Rmax (GFlops) on a normal personal computer ?
>>
>>Opteron delivers 2 flops per cycle per core.
>>
>>without using calculator:
>>So a 2.2ghz dual core opteron delivers 2.2 * 2 * 2 = 4 * 2.2 = 8.8 gflop
>>So a quad opteron delivers 4 * 8.8 = 35.2 gflop
>>
>>However the comparision is not fair. IBM is always quoting single precision
>>calculations whereas majority of researchers uses double precision floating
>>points.
>
>I do not believe that is true.  I'm not aware of _any_ floating point hardware

Please check online sources, the 256 gflop quoted for CELL processor is based
upon single precision floating point.

>today that does 32 bit IEEE math.  Every processor I have seen does internal
>calculations in 64 (actually 80) bits of precision.  From the early IBM RS 6000

At x87 floating point processors it is 80 bits for example. This will however
not be there anymore in the future. Majority of the gflops of the
chips such as itanium2 is usually not even following ansi-c specifications.

See my postings in hardware groups that have proven this.

For example the intel c++ compiler has a special flag which you will have to
give it, in order to calculate more accurately floating point with more
precision than default 'shortcuts' it takes.

>this has been true.  I once spent a couple of days trying to understand why
>using REAL*4 vs REAL*8 in a FORTRAN program made _zero_ difference in how fast

Yeah you're living in the 80s still.

To wake you up nowadays we have things like SSE and SSE2 and also 3d now.

For example your favourite opteron hardware has 3d now which splits virtually
each 64 bits registers in 2 single precision floating point.

See for example "amd 64 architecture programmes manual 1 application
programming" page 270 5.3.6 floating point 3d now".

It has a clear diagram to illustrate to you how vector operations work at the PC
nowadays, and why they executing 1 instruction a cycle with 2 cycle latency
still can reach 2 gflop single precision.

>it ran, where on an older IBM /370 it made a significant difference since that
>box actually had 32 bit and 64 bit hardware.

A 64x64 bits multiplication at opteron eats 2 cycles and has a 128 bits
result. However, with SSE2 you can not reach that accuracy but you can do
2 flops per cycle.

For some prime searching software mine the 64x64 bits multiplication delivering
128 bits is more useful than 2 multiplications of 64x64 bits delivering 64.

So there is a problem here that more bits is more useful, but it just doesn't
deliver them.

>>
>>If you want to know exact definitions of what is a double precision floating
>>point, look in the ansi-C definitions.
>>
>>In reality the researchers assume 64 bits 'double times a 64 bits double
>>delivering a 64 bits double.
>>
>>In reality single precision is less than 32 bits times 32 bits delivering less
>>than 32 bits worth of information.
>
>Why less than?  You lose exponent bits in either length, but the exponent is not
>"lost information"...

For ansi-c specifications on how precise floating point must be,
see ansi_c definitions page 455 '7.library annex E'.

It just has to represent something close to 1.000000001 * 10^37
for a double. Which fits in less than 64 bits. 10 digits mantissa,
10 power = 37.

I might be a few bits off, but not many:
  10 digits = 33 bits
  -37..37   = 9 bits.

With 42-43 bits you get already very far.

So double (8 bytes type) has no need for 80 bits precision.
Please note that long double has same definition.

To give one trivial example how you can be faster with less bits:

most of those highend chips do not have a division instruction (for example
itanium2) and therefore for divisions try to
get away with less accuracy than ansi-c requires.

Of course that speeds up software bigtime, because otherwise such a division
approximation eats like 46 cycles or so.

Itanium2 has a hardware instruction that can do 1/x and it keeps
the data within registers. So with an algorithm they can then approach a
division a little. The longer this algorithm runs the more accurate the
approximation of the division instruction is.

If you realize it can execute in theory 6 instructions pro cycle (though only 2
integer instructions a cycle, which explains why it is not so fast for
computerchess for itanium2, montecito should change that, but of course is too
expensive for us to take serious) and then lose somewhere around 50 cycles,
that's a potential loss of 300 instructions. So it matters really a lot
in such codes to optimize the division to cheaper magnitudes, by losing
accuracy.

There are many of such examples.

I refer to for example Professor Aad v/d Steen (Highperformance supercomputing)
and his report of 1 july 2003, where he reports loss of accuracy with default
compiler settings at itanium2. See www.sara.nl

>
>>
>>Major cheating happens in those areas of course, for example the highend
>>processors like itanium2, intel forgot to put a divide instruction at it.

>Design decision.  Cray computers had no divide either.  Never caused them to be
>less than the fastest floating point processor of their time...

For your information, Cray is selling opteron clusters nowadays. Their main
processor therefore is opteron.

>>
>>So they can do divisions in certain test programs faster by using some
>>approximation algorithm delivering less decimals.
>>
>>So all those gflops mentionned are basically multiplication-add combinations.
>>
>>The CELL processor is supposed to deliver 256 gflop single precision, this is
>>however less than 30 gflop double precision.
>>
>>In reality software isn't optimal so it will be less than 30 gflop.
>>
>>Still it is impressive for a processor that is supposed to get cheap.
>>
>>The expensive itanium2 1.5Ghz delivers for example 7 gflop on paper. That's also
>>paper. SGI when presenting results at the 1 juli 2003 presentation of the 416
>>processor itanium2 1.3Ghz cpu, made public there that effecitvely it is
>>2 times faster in gflops for most applications than the previously 500Mhz MIPS
>>R14000.
>>
>>On paper the MIPS delivers 1 gflop at 500Mhz and on paper the 1.3Ghz itanium2
>>delivers 5.2 gflop.
>>
>>Practical 2 times faster according to SGI.
>>
>>NASA had a similar report initially for their own software when running at a 512
>>processor partition.
>>
>>So all those gflops you have to take with some reservation. Reality is those
>>supercomputers usually idle for 70% in the first year, they idle 50% in the
>>second and 3d year, and when they are outdated in the 4th year they are idle for
>>30%. That is of course all reserved times added and all 'system processors' not
>>taken into account. In reality they idle more.
>>
>>So many of those supercomputers are paper hero's which the researchers litterary
>>use to "run their application faster than it would run on a pc".
>>
>>There is real few applications that are utmost optimized. Certain matrix
>>calculation type libraries are pretty good and are pretty optimal for it.
>>
>>For those researchers those gflops *really* matter.
>>
>>You can count them at 1 hand.
>
>Wrong.  You need to get out more.  Labs like Livermore and Los Alamos have
>_thousands_ of carefully hand-optimized programs for specific computer
>architectures.  They care whether a program runs in 2 weeks or 6 weeks.  Or
>longer.

A handful of applications is well optimized on the planet. 99.9% isn't.

Los Alamos is basically using some matrix libraries which are well optimized,
see above.

Your 'wrong' doesn't apply at all.

>
>>
>>What matters is they have the POSSIBILITY to run their application real real
>>fast if they want to, and that is real important.
>>
>>This big 12288 ibm supercomputer 'blue gene' boxes (6 racks of 2048 processors)
>>has a cost price of just 6 million euro.
>>
>>That's real little if you consider the huge calculation power it delivers for
>>those researchers it matters for.
>>
>>Best usages of those supercomputers are nucleair explosions (i did not say
>>university groningen is running nucleair simulations) and calculating for
>>example where electrons in materials are.
>>
>>Usually the homepage supports 'biologic' supercomputers. In reality just real
>>little system time goes to medicines and biologic research, about 0.5% system
>>time, according to this supercomputer report europe (including all scientific
>>supercomputers in entire europe).
>>
>>Amazingly much system time goes to all kind of weather or extreme climate
>>simulations. Like they have been calculating world wide so so much already at
>>what the height of the seawater will become. I became real real sick from that,
>>as i could not test diep until the world champs itself, because some weather
>>simulation was nonstop running.
>>
>>After they had run at 350+ processors for months (450000 cpu hours or so,
>>according to official project papers) and after they had created a new discovery
>>series from the output that sea water would rise 1 meter the coming 100 years,
>>they discovered a small bug in the initializing data.
>>
>>They had initialized the sea water 1 meter too high when starting the test half
>>a year earlier.
>>
>>This was the reason diep ran buggy the first 7 rounds in world champs 2003.
>>
>>Vincent



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.