Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Correction hydra hardware

Author: Robert Hyatt

Date: 18:53:33 02/01/05

Go up one level in this thread


On February 01, 2005 at 17:04:42, Matthew Hull wrote:

>On February 01, 2005 at 16:28:22, Robert Hyatt wrote:
>
>>On February 01, 2005 at 13:26:04, Vincent Diepeveen wrote:
>>
>>>On February 01, 2005 at 12:59:31, Robert Hyatt wrote:
>>>
>>>>On February 01, 2005 at 10:59:18, Vincent Diepeveen wrote:
>>>>
>>>>>On February 01, 2005 at 00:56:17, Robert Hyatt wrote:
>>>>>
>>>>>>On January 31, 2005 at 13:35:02, Vincent Diepeveen wrote:
>>>>>>
>>>>>>>On January 31, 2005 at 13:03:43, Robert Hyatt wrote:
>>>>>>>
>>>>>>>>On January 31, 2005 at 10:14:28, Vincent Diepeveen wrote:
>>>>>>>>
>>>>>>>>>On January 31, 2005 at 10:01:16, Vincent Lejeune wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>news from 28/01/05 (more to come)
>>>>>>>>>>
>>>>>>>>>>http://hydrachess.com/hydra-scylla.html
>>>>>>>>>>
>>>>>>>>>>32 nodes (previous version has 16) , no information about CPU power and FPGA
>>>>>>>>>>cards yet ...
>>>>>>>>>
>>>>>>>>>It's 4 nodes.
>>>>>>>>>
>>>>>>>>>1 node = 8 processor Xeon.
>>>>>>>>>
>>>>>>>>>FPGA cards would get double speed. So must be between 30Mhz and 60Mhz. They only
>>>>>>>>>use development fpga cards. So never use the real power of fpga (which is
>>>>>>>>>printing your own processor which can run hands down at 600Mhz or more). They
>>>>>>>>>stick to development cards for some unknown reason to me.
>>>>>>>>>
>>>>>>>>>CPU power is not interesting at all of course, cards do the work.
>>>>>>>>>
>>>>>>>>>Vincent
>>>>>>>>
>>>>>>>>I hope not.  Old machine used 8 boxes with 2 cpus per box.  Going to 8-way xeons
>>>>>>>>is a performance killer.  The PCI bus just can't keep up.
>>>>>>>
>>>>>>>Ok the sheikh nor his righthand didn't know himself very well the architectural
>>>>>>>details which we will forgive them.
>>>>>>>
>>>>>>>I have accurate information now.
>>>>>>>
>>>>>>>It is a 32 node system. With each node a dual. They have however 32 FPGA cards
>>>>>>>because when ordering the person filling in the form (and i am NOT going to post
>>>>>>>who it was, but it was NOT the sheikh) confused nodes for cpu's.
>>>>>>>
>>>>>>>So they have a mighty 32 node myrinet now with 64 processors. However 32 cards
>>>>>>>so they run 64 processor effectively while serving with 32 cards which do the
>>>>>>>job. Cards at 55Mhz.
>>>>>>>
>>>>>>>Please note that PCI bus isn't the problem. They are using pci-x.
>>>>>>
>>>>>>PCI-X falls flat if you have 8 cpus in a single box.  I have run on such
>>>>>
>>>>>It's not the pci-x which is the problem at all.
>>>>>
>>>>>It's that it is simply tough programming to get it to work.
>>>>
>>>>No it isn't.  One bus to memory, 8 processors hanging on the bus trying to get
>>>>to memory.  They get in the way of each other, and that is why bus architectures
>>>>don't scale very well beyond 4.  And even going to 4 requires 4-way interleaving
>>>>to keep up.  But they don't go to 8-way interleaving on the Dell-type boxes.
>>>>Others do but they have a price that shows it...
>>>>
>>>>For 8 and up, a crossbar is really the right way to go, if the price can be
>>>>handled.  Otherwise a NUMA-type approach like the AMD solution is most
>>>>affordable.
>>>>
>>>>>
>>>>>You simply must limit any read or write remote to the ultimate maximum.
>>>>>
>>>>>Multithreading, forget it.
>>>>
>>>>multi-threading or multiple processes is not the issue here.  What you can do
>>>>with one, you can do with the other.  One just offers easier-to-use features for
>>>>certain applications.
>>>>
>>>>
>>>>>
>>>>>>machines.  PCI does pretty well on 4way systems, but on 8-way, overall gain
>>>>>
>>>>>PCI is at least 4 times slower than pci-x in latency.
>>>>>
>>>>>pci-x can give easily 1 us if the network card is fast enough.
>>>>>
>>>>
>>>>We are not talking about network.  We are talking (at least I am talking) about
>>>>SMP-type boxes with 8 cpus in a single chassis, like the 8-cpu xeon Dell sells,
>>>>or like the 4-cpu xeon boxes I have here...
>>>
>>>I'm not talking about 8 way smp boxes. Those are more expensive than a 32
>>>processor cluster when you get latest processors inside.
>>>
>>
>>Quote on:
>>------------------------------------------
>>
>>It's 4 nodes.
>>
>>1 node = 8 processor Xeon.
>>------------------------------------------
>>quote off
>>
>>"one node = 8 processor xeon".  That is _clearly_ SMP.  And _that_ is the box I
>>was talking about.  I've run on more than one of them.  For chess they are bad,
>>running 2 copies of crafty produces 2x NPS of one copy.  Running 4 copies
>>produces nearly 4x the NPS of one copy.  Running 8 copes produces barely 6x the
>>1 copy speed.
>
>
>Would this be considered a NUMA relationship, between the two 4-way nodes.  Does
>your NUMA crafty handle this better than your older SMP-only crafty?
>


wasn't numa.  Used the Intel "Fusion" chipset to tie two 4-way processor groups
together to provide a total of 8, but there was a bottleneck.  It was one of
those "let's see what it will do" sort of things.  Big L2 cache processors could
provide good performance if memory requirements were limited.
I'll take a NUMA opteron system any day.



>
>
>>
>>_that_ was the PCI bottleneck I was talking about.
>>
>>
>>>>
>>>>
>>>>>practical even cheapo myrinet gives 2.7 us.
>>>>>
>>>>>>seems to be 1.5x a 4-way which is not that great.  If you run a program that
>>>>>>runs out of cache quickly, this drops even further.
>>>>>
>>>>>you can't run smp programs over networks obviously.
>>>>
>>>>That's why I wasn't talking about networks.  You originally said this machine
>>>>(New Hydra) has a node with 8 processors.  That is what I am talking about.
>>>
>>>Perhaps in future read the subject
>>>
>>
>>
>>Perhaps in the future you should remember what you wrote and what I responded
>>to?  I haven't changed the subject a single time in my posts in this thread.
>>
>>
>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>Latency of one way pingpong is around 2.7 us with myrinet. That excludes the
>>>>>>>router costs which will be also at about 1 us i guess for random data traffic
>>>>>>>(up to 35 ns for bandwidth traffic).
>>>>>>>
>>>>>>>Vincent
>>>>>>
>>>>>>
>>>>>>All depends.  We have myrinet here and are probably going to use that in our new
>>>>>>opteron cluster when we buy it after the dual-core opterons start shipping in
>>>>>>quantity...
>>>>>
>>>>>For chess myrinet sucks ass to say it very polite, because it doesn't allow DSM
>>>>>(distributed shared memory).
>>>>>
>>>>>For just a few dollar more you can get quadrics or dolphin which do have better
>>>>>latencies (dolphin 1 us) and allow distributed shared memory.
>>>>>
>>>>>The real major problem with myrinet is that the receiving process must non stop
>>>>>receive the messages and process them. So you must make some kind of hand timing
>>>>>within the search process to do just that.
>>>>>
>>>>>With DSM your processes don't feel any of that.
>>>>>
>>>>>A 8 node quadrics network is 13095 dollar. That includes everything.
>>>>>
>>>>>Quadrics is used in the fastest supercomputers, like the nuclear supercomputer
>>>>>France just ordered a while ago. It scales far superior to myrinet when you
>>>>>start scaling above those 8 nodes.
>>>>>
>>>>>For chess using the DSM features in a program is not so trivial, but pretty easy
>>>>>compared to the task of parallellizing a product.
>>>>>
>>>>>Vincent
>>>>
>>>>
>>>>Just remember that most supercomputer applications don't care about latency,
>>>
>>>Just remember that i don't build a cluster for a matrix calculation, but for
>>>DIEP :)
>>>
>>>In which case myrinet sucks :)
>>
>>
>>That's fine.  But "supercomputers" are not built to your specifications.  They
>>are built to address the programming requirements of large numerical systems
>>being used.  They do that well.  Your car wasn't designed to fly either.  You'd
>>have hell getting it to do so, although a lowly piper cub can do it since it was
>>designed to do so...
>>
>>
>>>
>>>>they care about bandwidth.  Large applications are all about streaming data,
>>>the
>>>>latency for the first word is not important when several million are going to
>>>>follow back-to-back.  All that matters to the big applications is how frequently
>>>>do I get the next word, the latency for the first word gets buried in the cost
>>>>to transfer the remaining millions of words.  That's what makes vector computers
>>>>so powerful for the right kinds of applications, as opposed to these "toy
>>>>supercomputers" that just use lots of general purpose processors and sloppy
>>>>interconnections.
>>>
>>>Majority of supercomputers are not vector computers. Latency is important also
>>>at supercomputers. Majority of jobs run at supercomputers are 4-8 processors and
>>>eat half the system time a year. The other half is matrix calculations and
>>>trivially could run at cheapo clusters.
>>>
>>>Only a few applications are really optimized and i wonder why they don't run
>>>those on clusters but use very expensive SGI type hardware for it which delivers
>>>very little flops per dollar.
>>>
>>>Vincent
>>
>>Because a cluster can't offer 1/100th the total memory bandwidth of a big Cray
>>vector box.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.