Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: TERAS sgi 3800 origin at 1024 cpu

Author: Vincent Diepeveen
Date: 19:29:28 04/12/03
On April 12, 2003 at 02:34:22, Tony Werten wrote:

>On April 11, 2003 at 14:26:57, Vincent Diepeveen wrote:
>
>>On April 11, 2003 at 14:06:23, Keith Evans wrote:
>>
>>>On April 11, 2003 at 07:56:02, Vincent Diepeveen wrote:
>>>
>>>>On April 10, 2003 at 17:14:21, Keith Evans wrote:
>>>>
>>>>>On April 10, 2003 at 15:26:24, Johan Hutting wrote:
>>>>>
>>>>>>On April 10, 2003 at 13:56:29, Keith Evans wrote:
>>>>>>
>>>>>>> Is the 500
>>>>>>>processor box running Windows? Linux?
>>>>>>
>>>>>>Operating system: Irix
>>>>>>(http://www.sara.nl/userinfo/teras/usage/progavail/irix/index.html)
>>>>>>
>>>>>>other stats: http://www.sara.nl/userinfo/teras/description/index.html
>>>>>
>>>>>Vincent are you able to run the same code on an SMP PC and on a 500 CPU CC-NUMA
>>>>>machine?
>>>>
>>>>yes 100% the same code compiles. i do not modify a byte in fact.
>>>>
>>>>the makefile in irix has litterary:
>>>>
>>>>-DIRIX
>>>>
>>>>and the windows file doesn't have that one.
>>>>
>>>>in my diep.h:
>>>>
>>>>#if IRIX
>>>>  #define MAXPROCESSES 512
>>>>#else
>>>>  #define MAXPROCESSES 16  // right now, used to be 8 ;)
>>>>
>>>>>Curious,
>>>>>Keith
>>>
>>>Cool.
>>>
>>>Pardon my ignorance, but when is the big competition again?
>>>
>>>Will you publish some technical info after the comp? Like what sort of speedup
>>>you're getting from 512 processors?
>>>
>>>Regards,
>>>Keith
>>
>>Everything will be published including logfiles with numbers in it.
>>
>>i never play unfair. everything will be posted on the diep speedups also in the
>>ICGA journal with url where you can find the logfiles (or i'll email them at
>>request).
>>
>>maximum number of addressable cpu's is 508 then some need to be used by other
>>dudes who continuesly do stuff. so i hope that isn't 300 cpu's during the
>>tournament as then only 200 are left to get used.
>>
>>I can only do those 500 processor tests probably during the rounds at world
>>champs 2003.
>>
>>The only thing i cannot garantuee is when the logfiles get posted. It will be
>>probably around or after the world champs. Not before, but directly after sure.
>>
>>Most likely the logfiles while playing get posted while it plays. Except perhaps
>>the game against the junior team the logfiles will not be posted live, because i
>>do know what they do if they see a brilliant line from DIEP in the logfile...
>>
>>I would be very happy with around 10% speedup.
>
>I've needed a little thought about about your numbers. Am I correct in assuming
>10% means: 10% of the max speedup of 500 processors ?

In this example the definition is used that
   1 processor = 1.0

and 500 processors = 50.0 indeed.

Do not forget that this though very impressive, is perhaps less impressive at
actual tournament practice.

Example. Single cpu 500Mhz R14000 diep gets with shared memory 20K nps.

So no matter how many millions of nodes a second i get. If that would come down
to 10% then that would mean effectively it compares to a single cpu of 1 MLN a
second.

DIEP at a dual K7 1.6ghz gets however already easily 130-150k nps.

So the quality of the supercomputer processor is heavily dominating the actual
impressiveness of this speedup. In general supercomputer processors suck when
you compare them with x86 processors for computerchess. They suck bigtime. There
is good reasons why they suck.

x86 sells for tens of billions. you can design pretty nice cpu's for that money.
Way better ones than for a couple of hundreds of millions.

On the other hand performance of supercomputer cpu's is pretty poor for well
optimized software when we take into account that each highend cpu *may* cost a
lot of money whereas lowend cpu's may not.

Then vaste majority of specint programs (crafty is the happy exception) heavily
get dominated by how fast and big caches are. For chessprograms it doesn't
matter how limited cache is on CPUs, the chessprogrammers simply work hard and a
lot at their sofware to get close to that IPC of the processor, thereby taking
into account limited cache. Supercomputer processors seem to be in a vicious
circle here of having big caches. Usually that's needed to avoid getting
penalized by the slow access times to memory (getting a cache line from memory
at a supercomputer, even if it is local memory, usually is having a way worse
latency than at a x86 machine, because of all kind of fancy stuff that's in
between the cpu and the memory). However the big L2 cache that takes care of
that problem also creates itself a new problem and that is that the cpu cannot
get clocked high.

Now we can discuss till the end of our life, but whatever you toy at home, if
your supercomputer thing is clocked 1Ghz then that will never be as fast as a
2.5ghz opteron of course.

Opteron is the ideal processor in that respect:
  a) it is high clocked (2.5ghz start of 2004) when compared to its competition
  b) it has a very BIG L1 cache (regrettably itanium2 has not)
  c) it has a huge L2 cache (1 MB)

So in short the weak chain of all these supercomputers is and always will be the
effective speed of them when compared to cheap x86 SMP stuff. 2-way opteron is
just like 680$ a piece.


>So Diep searches about 50 x faster on this machine ? Nice ! Is that time to
>solution ?
>
>BTW, how is your branching factor holding ? I assume it's not possible to stay
>below 3, those last few (3-4) plies ?
>
>Tony
>
>>
>>But never forget that's actual speedup. that's with a program which also runs
>>well at a single cpu machine. zugzwang claimed for example 50% speedup.
>>
>>but that was with 512 processors and in total 5 ply searches fullwidth were
>>performed. he calculated speedup based upon number of nodes at a 5 ply fullwidth
>>search.
>>
>>Imagine that i create a bug in diep that only 1 cpu searches and the rest idles.
>>Then my speedup in number of nodes is of course 100%. But my speedup in time is
>>1.0 out of 512.
>>
>>They concluded 50% from that, which is very poor if you consider how little
>>nodes a second a single cpu also got with them. I do not see how they could do
>>that. Their way of measuring sucked. Deep Blue guessed with a wet finger that it
>>was 8% without any evidence at around 30 nodes. Not even their output shows
>>number of searches searched.
>>
>>Hyatt on the other hand claims on a shared memory machine 11.x out of 16 cpu's
>>or so. He did not slow down his program 30 times like the zugzwang team did, but
>>did some modification to his search results as i have proven in august 2002. you
>>still can get the data in a table and look for yourself. All he tried to cover
>>up for is probably a factor 2 which he lost somewhere. Still considering the
>>fact that he didn't slow down his thing 30 times, it is a very good speed.
>>
>>Then we have many poor researches. Without exception they managed all to get
>>great results, without showing any logfile.
>>
>>DIEP's output however shows everything. In logfiles are all statistics
>>available. local hashhits,globalhashhits, number of nodes needed for each ply
>>and for each mainline. number of searches done. number of failhighaborts (1 cpu
>>aborting possibly other cpu's as well), how many nodes were done during
>>nullmove, etcetera. In short nothing amateuristic where by not showing anything
>>you can cover up for having a sucking program.
>>
>>I have no statistics to hide in that respect. To my sponsor NWO i have already
>>written down in the 400 pages or so i wrote to get system time (that doesn't
>>only involve nwo but another 6 other organisations), that it is my intention to
>>publish all logfiles open and fair. No bad science there.
>>
>>The NWO of course is an organisation that appreciates this.
>>
>>Best regards,
>>Vincent
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.