Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Correction hydra hardware

Author: Matthew Hull

Date: 14:04:42 02/01/05

Go up one level in this thread


On February 01, 2005 at 16:28:22, Robert Hyatt wrote:

>On February 01, 2005 at 13:26:04, Vincent Diepeveen wrote:
>
>>On February 01, 2005 at 12:59:31, Robert Hyatt wrote:
>>
>>>On February 01, 2005 at 10:59:18, Vincent Diepeveen wrote:
>>>
>>>>On February 01, 2005 at 00:56:17, Robert Hyatt wrote:
>>>>
>>>>>On January 31, 2005 at 13:35:02, Vincent Diepeveen wrote:
>>>>>
>>>>>>On January 31, 2005 at 13:03:43, Robert Hyatt wrote:
>>>>>>
>>>>>>>On January 31, 2005 at 10:14:28, Vincent Diepeveen wrote:
>>>>>>>
>>>>>>>>On January 31, 2005 at 10:01:16, Vincent Lejeune wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>news from 28/01/05 (more to come)
>>>>>>>>>
>>>>>>>>>http://hydrachess.com/hydra-scylla.html
>>>>>>>>>
>>>>>>>>>32 nodes (previous version has 16) , no information about CPU power and FPGA
>>>>>>>>>cards yet ...
>>>>>>>>
>>>>>>>>It's 4 nodes.
>>>>>>>>
>>>>>>>>1 node = 8 processor Xeon.
>>>>>>>>
>>>>>>>>FPGA cards would get double speed. So must be between 30Mhz and 60Mhz. They only
>>>>>>>>use development fpga cards. So never use the real power of fpga (which is
>>>>>>>>printing your own processor which can run hands down at 600Mhz or more). They
>>>>>>>>stick to development cards for some unknown reason to me.
>>>>>>>>
>>>>>>>>CPU power is not interesting at all of course, cards do the work.
>>>>>>>>
>>>>>>>>Vincent
>>>>>>>
>>>>>>>I hope not.  Old machine used 8 boxes with 2 cpus per box.  Going to 8-way xeons
>>>>>>>is a performance killer.  The PCI bus just can't keep up.
>>>>>>
>>>>>>Ok the sheikh nor his righthand didn't know himself very well the architectural
>>>>>>details which we will forgive them.
>>>>>>
>>>>>>I have accurate information now.
>>>>>>
>>>>>>It is a 32 node system. With each node a dual. They have however 32 FPGA cards
>>>>>>because when ordering the person filling in the form (and i am NOT going to post
>>>>>>who it was, but it was NOT the sheikh) confused nodes for cpu's.
>>>>>>
>>>>>>So they have a mighty 32 node myrinet now with 64 processors. However 32 cards
>>>>>>so they run 64 processor effectively while serving with 32 cards which do the
>>>>>>job. Cards at 55Mhz.
>>>>>>
>>>>>>Please note that PCI bus isn't the problem. They are using pci-x.
>>>>>
>>>>>PCI-X falls flat if you have 8 cpus in a single box.  I have run on such
>>>>
>>>>It's not the pci-x which is the problem at all.
>>>>
>>>>It's that it is simply tough programming to get it to work.
>>>
>>>No it isn't.  One bus to memory, 8 processors hanging on the bus trying to get
>>>to memory.  They get in the way of each other, and that is why bus architectures
>>>don't scale very well beyond 4.  And even going to 4 requires 4-way interleaving
>>>to keep up.  But they don't go to 8-way interleaving on the Dell-type boxes.
>>>Others do but they have a price that shows it...
>>>
>>>For 8 and up, a crossbar is really the right way to go, if the price can be
>>>handled.  Otherwise a NUMA-type approach like the AMD solution is most
>>>affordable.
>>>
>>>>
>>>>You simply must limit any read or write remote to the ultimate maximum.
>>>>
>>>>Multithreading, forget it.
>>>
>>>multi-threading or multiple processes is not the issue here.  What you can do
>>>with one, you can do with the other.  One just offers easier-to-use features for
>>>certain applications.
>>>
>>>
>>>>
>>>>>machines.  PCI does pretty well on 4way systems, but on 8-way, overall gain
>>>>
>>>>PCI is at least 4 times slower than pci-x in latency.
>>>>
>>>>pci-x can give easily 1 us if the network card is fast enough.
>>>>
>>>
>>>We are not talking about network.  We are talking (at least I am talking) about
>>>SMP-type boxes with 8 cpus in a single chassis, like the 8-cpu xeon Dell sells,
>>>or like the 4-cpu xeon boxes I have here...
>>
>>I'm not talking about 8 way smp boxes. Those are more expensive than a 32
>>processor cluster when you get latest processors inside.
>>
>
>Quote on:
>------------------------------------------
>
>It's 4 nodes.
>
>1 node = 8 processor Xeon.
>------------------------------------------
>quote off
>
>"one node = 8 processor xeon".  That is _clearly_ SMP.  And _that_ is the box I
>was talking about.  I've run on more than one of them.  For chess they are bad,
>running 2 copies of crafty produces 2x NPS of one copy.  Running 4 copies
>produces nearly 4x the NPS of one copy.  Running 8 copes produces barely 6x the
>1 copy speed.


Would this be considered a NUMA relationship, between the two 4-way nodes.  Does
your NUMA crafty handle this better than your older SMP-only crafty?



>
>_that_ was the PCI bottleneck I was talking about.
>
>
>>>
>>>
>>>>practical even cheapo myrinet gives 2.7 us.
>>>>
>>>>>seems to be 1.5x a 4-way which is not that great.  If you run a program that
>>>>>runs out of cache quickly, this drops even further.
>>>>
>>>>you can't run smp programs over networks obviously.
>>>
>>>That's why I wasn't talking about networks.  You originally said this machine
>>>(New Hydra) has a node with 8 processors.  That is what I am talking about.
>>
>>Perhaps in future read the subject
>>
>
>
>Perhaps in the future you should remember what you wrote and what I responded
>to?  I haven't changed the subject a single time in my posts in this thread.
>
>
>
>>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>Latency of one way pingpong is around 2.7 us with myrinet. That excludes the
>>>>>>router costs which will be also at about 1 us i guess for random data traffic
>>>>>>(up to 35 ns for bandwidth traffic).
>>>>>>
>>>>>>Vincent
>>>>>
>>>>>
>>>>>All depends.  We have myrinet here and are probably going to use that in our new
>>>>>opteron cluster when we buy it after the dual-core opterons start shipping in
>>>>>quantity...
>>>>
>>>>For chess myrinet sucks ass to say it very polite, because it doesn't allow DSM
>>>>(distributed shared memory).
>>>>
>>>>For just a few dollar more you can get quadrics or dolphin which do have better
>>>>latencies (dolphin 1 us) and allow distributed shared memory.
>>>>
>>>>The real major problem with myrinet is that the receiving process must non stop
>>>>receive the messages and process them. So you must make some kind of hand timing
>>>>within the search process to do just that.
>>>>
>>>>With DSM your processes don't feel any of that.
>>>>
>>>>A 8 node quadrics network is 13095 dollar. That includes everything.
>>>>
>>>>Quadrics is used in the fastest supercomputers, like the nuclear supercomputer
>>>>France just ordered a while ago. It scales far superior to myrinet when you
>>>>start scaling above those 8 nodes.
>>>>
>>>>For chess using the DSM features in a program is not so trivial, but pretty easy
>>>>compared to the task of parallellizing a product.
>>>>
>>>>Vincent
>>>
>>>
>>>Just remember that most supercomputer applications don't care about latency,
>>
>>Just remember that i don't build a cluster for a matrix calculation, but for
>>DIEP :)
>>
>>In which case myrinet sucks :)
>
>
>That's fine.  But "supercomputers" are not built to your specifications.  They
>are built to address the programming requirements of large numerical systems
>being used.  They do that well.  Your car wasn't designed to fly either.  You'd
>have hell getting it to do so, although a lowly piper cub can do it since it was
>designed to do so...
>
>
>>
>>>they care about bandwidth.  Large applications are all about streaming data,
>>the
>>>latency for the first word is not important when several million are going to
>>>follow back-to-back.  All that matters to the big applications is how frequently
>>>do I get the next word, the latency for the first word gets buried in the cost
>>>to transfer the remaining millions of words.  That's what makes vector computers
>>>so powerful for the right kinds of applications, as opposed to these "toy
>>>supercomputers" that just use lots of general purpose processors and sloppy
>>>interconnections.
>>
>>Majority of supercomputers are not vector computers. Latency is important also
>>at supercomputers. Majority of jobs run at supercomputers are 4-8 processors and
>>eat half the system time a year. The other half is matrix calculations and
>>trivially could run at cheapo clusters.
>>
>>Only a few applications are really optimized and i wonder why they don't run
>>those on clusters but use very expensive SGI type hardware for it which delivers
>>very little flops per dollar.
>>
>>Vincent
>
>Because a cluster can't offer 1/100th the total memory bandwidth of a big Cray
>vector box.



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.