Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Cray and supercomputers (kinda long)

Author: Vincent Diepeveen

Date: 09:11:50 09/20/05

Go up one level in this thread


On September 16, 2005 at 11:40:03, Joshua Shriver wrote:

>A friend of mine uses myrinet on his opteron cluster. I've heard many good
>things about it with respect to having a relatively low latency.

Actually quadrics is a way faster network. Myrinet is most popular,
because the manufacturers earn more onto it.

Dolphin is also nice, but like myrinet has little cache on card, nor
allows distributed shared memory.

Quadrics has 64MB on chip which can be used for distributed shared memory. Myri
nor Dolphin allow this.

Other interconnects that slowly get more popular is infiniband. It has real ugly
latency though.

To compare latencies, Myri presents on paper a latency of 2.7 us for the latest
cards one way pingpong. This budget card has a few disadvantages like ugly
switch latencies. They actually don't give information free about how slow it is
switching between threads/processes.

Dolphin shows a one way pingpong latency of around 1.66 us. That's real good.

So for MPI workloads, dolphin is fastest card.

The advantages of Quadrics are basically when the cluster gets bigger or when
you want to take advantage of its shmem library.

Quadrics scales better to large supercomputers. This is why for example a big
supercomputer like the french one with 8000+ cpu's itanium2's, is using
quadrics. It scales better.

However because it's like $500 more expensive a node than myrinet, of course
manufacturers deliver the cheap low budget myri cards.

Please note that when speaking in total money of the cluster, the consumer
doesn't feel the network price difference.

If you buy a big cluster, you can choose which network you want to.

Just the manufacturer earns more onto myri, that's all.

The more expensive myri cards with a bit more cache (8MB) are $1500 a card,
and still worse than those dolphin and quadrics cards.

But when you've got thousands of nodes, then obviously it's better to go
quadrics, as their level2 and level3 routers are way way faster for one way ping
pong latency and in practice also for bandwidth.

The distributed shared memory is very nice, because for good programmers that
speeds up communication a lot. You simply can put in RAM an array where results
get stored in. If it has a 1, you know there is a new job.

No need anymore for the MPI calls then to each time check for overflows and so
on.

>Kind of OT, but I remember reading that you once used crafty or another engine
>on a Cray before. Mind talking about it?

Last cray with own processors was a long time ago from Cray.

Nowadays Cray sells especially clusters of opterons. 1 node == 12 processor
opteron. They are away 1.5 us in one way pingpong latency from each other.

>From what I understand about Cray's, they're like a cluster in that they have a
>lot of cpu's all working together instead of one massive processor. Though it
>used some kind of special logic board for connecting them all together so no
>network like ethernet/myrinet/etc between nodes was needed (or something to that
>effect).

Crays old approach was real nice for its days, but today it's just not relevant.
Todays supercomputers litterary use thousands of processors.

>With today's extremely fast processors, it seems the super computing market has
>died down, or at least switched. A couple years ago I went to a seminar at the

Not died down, machines are bigger than ever before. The total volume of money
in that market has become more and more. Yet they can no longer compete against
'pc processors'. Making a real good chip is nearly impossible if you can just
print in total half a million from them or so.

Good example is itanium cpu. Though a good cpu when announced, it had 3 years
delay before it was there and another few years before itanium became itanium2.

For diep Itanium2 was effectively 3 times faster than itanium1. A pc processor
already outgunned itanium bigtime. When itanium2 was there at at most 1.5ghz,
there was nearly opteron. A 1.5Ghz opteron == 1.5Ghz itanium2 for Diep.

>Pittsburgh Super Computing Center and at the time they where almost entirely
>cluster based (Win NT or Linux). I talked with one of the gentlemen in charge
>after one day and was shocked to hear they had a Cray (1? 2?) just sitting in
>storage unused. Apparently it cost over a $1M just for the electricity bill, and
>it required some kind of special coolant that was expensive even for a small
>amount.

Supercomputers in USA are notorious for being idle.

Europe is a different thing. There is too little of them here.

Example, a single university in USA has several thousand cpu supercomputers.

In my country we have currently 1 university with a good supercomputer
(www.lofar.nl) though it's a bit dubious who of the both owns the supercomputer
and where it's located. It's a 12288 cpu ibm box delivering nearly 40 Tflop.

Yet that box is just good for nuclear explosions. The IBM stuff has ugly
latencies and 1 cpu is just 700Mhz. So great for nuclear researchers and a few
others who need matrix calculations in the north. Big shit for the rest. This
computer is shared with a few other countries researchers for lofar.

The entire country supercomputer organisations have a 416 cpu altix3000, a 1024
cpu 500Mhz origin3800 (which start of 2006 gets shreddered). And a stupid P4
Xeon cluster with infiniband of a couple of hundreds of nodes (bought in end
2004, talking about bad choices of my government).

So basically that P4 cluster and a 416 cpu altix3000 must serve the entire
nation of 16 million inhabitants. Very well paid are the many organisations that
are guarding these supercomputers.

Netherlands is doing very well in supercomputer area when compared to other
small countries in Europe.

However if you compare this to what the average university in USA has, it's
major shit.

Yet the reason for all this has to do with budget. The entire university/college
system in netherlands is serving a LOT of students, yet the total budget they do
it from is less than 1 big university in USA uses for budget.

If you look it from that viewpoint, things make suddenly sense.

>So compared to super fast mainframes or supercomputers... it seems clusters
>give you the more bang for the bug. No special hardware needed, and cheaper
>costs.
>But what does this mean for computer chess?

A lot of challenges to get software to work at a cluster. However you will
understand that i simply cannot get system time enough to test at a cluster and
regurarly use it.

At home i just have a 2 node cluster.

>If money wasn't an issue, what would really be ideal hardware for the best
>computer machine?

Quadrics network just like the french supercomputer has it and each node a 8 cpu
dual core opteron and plenty of testtime.

The real problem is testing at those boxes.

I could get a cluster for a world champs 2005, but without enough testtime this
is not relevant.

I could test however at the quad opteron dual core from www.hotels.nl so i took
their offer and was very happy with it.

>Top of the line IBM mainframe?  200 node Dual core-dual opteron cluster?
>Assuming that the code could utilize the hardware.

IBM is mainly selling myrinet, as they earn more onto it. Their top boxes that
are visible in the top 1 position in www.top500.org is gflop boxes with 700Mhz
cpu's and latencies are ugly for computer chess. So those boxes are worthless
for chess.

Please note the most ugly box is what hydra is using a P4 dual xeon box with
myrinet. Probably $500 cards inside it as the sheikh didn't know the difference
between the cheapo cards and a good network.

Very much writen about is the Altix box of course. Up to a cpu or 32 it is a
great box. At 64 cpu's and above the one way pingpong latency is 3-4 us.

At 64+ cpu's or above its one way pingpong latency is actually a lot slower than
from quadrics. There is something really wrong in the design of those boxes and
always was with SGI. No difference in one way pingpong latency of numalink3
versus numalink4 there is. Just a bandwidth difference on paper.

Of course like the french super you'll need 2 quadrics cards in each node to
really perform well.

Cray i didn't speak about much yet, as todays boxes of them are build upon 12
cpu opterons and those 12 opterons are already configured as a cluster.

So that's worthless for chess because 1 node is too slow.

So you see the ideal box doesn't exist for computerchess, simply because it's
cheaper to make clusters from dual Xeons. Scientists know shit from which cpu is
faster, a Xeon or an opteron. They have no clue, really.

Actually a supercomputer report i have from my own government which is
systematically indexing all good cpu's suggests that opteron has MMX technology
(and not SSE2).

That's why they took a P4 of course for clustering, as it has SSE2.

Vincent

>Sorry for the ramble :) just something on my mind.
>
>Josh
>
>>
>>Yes.  gigabit ethernet is high bandwidth, but still long latency.  We have a new
>>128 node dual xeon cluster in the department using myrinet, which is lower
>>latency.  Our old cLAN switch was the lowest latency I have ever seen, but it
>>was pricey as all hell...  It was also not TCP/IP based...



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.