Author: Matthew Hull
Date: 14:04:42 02/01/05
Go up one level in this thread
On February 01, 2005 at 16:28:22, Robert Hyatt wrote: >On February 01, 2005 at 13:26:04, Vincent Diepeveen wrote: > >>On February 01, 2005 at 12:59:31, Robert Hyatt wrote: >> >>>On February 01, 2005 at 10:59:18, Vincent Diepeveen wrote: >>> >>>>On February 01, 2005 at 00:56:17, Robert Hyatt wrote: >>>> >>>>>On January 31, 2005 at 13:35:02, Vincent Diepeveen wrote: >>>>> >>>>>>On January 31, 2005 at 13:03:43, Robert Hyatt wrote: >>>>>> >>>>>>>On January 31, 2005 at 10:14:28, Vincent Diepeveen wrote: >>>>>>> >>>>>>>>On January 31, 2005 at 10:01:16, Vincent Lejeune wrote: >>>>>>>> >>>>>>>>> >>>>>>>>>news from 28/01/05 (more to come) >>>>>>>>> >>>>>>>>>http://hydrachess.com/hydra-scylla.html >>>>>>>>> >>>>>>>>>32 nodes (previous version has 16) , no information about CPU power and FPGA >>>>>>>>>cards yet ... >>>>>>>> >>>>>>>>It's 4 nodes. >>>>>>>> >>>>>>>>1 node = 8 processor Xeon. >>>>>>>> >>>>>>>>FPGA cards would get double speed. So must be between 30Mhz and 60Mhz. They only >>>>>>>>use development fpga cards. So never use the real power of fpga (which is >>>>>>>>printing your own processor which can run hands down at 600Mhz or more). They >>>>>>>>stick to development cards for some unknown reason to me. >>>>>>>> >>>>>>>>CPU power is not interesting at all of course, cards do the work. >>>>>>>> >>>>>>>>Vincent >>>>>>> >>>>>>>I hope not. Old machine used 8 boxes with 2 cpus per box. Going to 8-way xeons >>>>>>>is a performance killer. The PCI bus just can't keep up. >>>>>> >>>>>>Ok the sheikh nor his righthand didn't know himself very well the architectural >>>>>>details which we will forgive them. >>>>>> >>>>>>I have accurate information now. >>>>>> >>>>>>It is a 32 node system. With each node a dual. They have however 32 FPGA cards >>>>>>because when ordering the person filling in the form (and i am NOT going to post >>>>>>who it was, but it was NOT the sheikh) confused nodes for cpu's. >>>>>> >>>>>>So they have a mighty 32 node myrinet now with 64 processors. However 32 cards >>>>>>so they run 64 processor effectively while serving with 32 cards which do the >>>>>>job. Cards at 55Mhz. >>>>>> >>>>>>Please note that PCI bus isn't the problem. They are using pci-x. >>>>> >>>>>PCI-X falls flat if you have 8 cpus in a single box. I have run on such >>>> >>>>It's not the pci-x which is the problem at all. >>>> >>>>It's that it is simply tough programming to get it to work. >>> >>>No it isn't. One bus to memory, 8 processors hanging on the bus trying to get >>>to memory. They get in the way of each other, and that is why bus architectures >>>don't scale very well beyond 4. And even going to 4 requires 4-way interleaving >>>to keep up. But they don't go to 8-way interleaving on the Dell-type boxes. >>>Others do but they have a price that shows it... >>> >>>For 8 and up, a crossbar is really the right way to go, if the price can be >>>handled. Otherwise a NUMA-type approach like the AMD solution is most >>>affordable. >>> >>>> >>>>You simply must limit any read or write remote to the ultimate maximum. >>>> >>>>Multithreading, forget it. >>> >>>multi-threading or multiple processes is not the issue here. What you can do >>>with one, you can do with the other. One just offers easier-to-use features for >>>certain applications. >>> >>> >>>> >>>>>machines. PCI does pretty well on 4way systems, but on 8-way, overall gain >>>> >>>>PCI is at least 4 times slower than pci-x in latency. >>>> >>>>pci-x can give easily 1 us if the network card is fast enough. >>>> >>> >>>We are not talking about network. We are talking (at least I am talking) about >>>SMP-type boxes with 8 cpus in a single chassis, like the 8-cpu xeon Dell sells, >>>or like the 4-cpu xeon boxes I have here... >> >>I'm not talking about 8 way smp boxes. Those are more expensive than a 32 >>processor cluster when you get latest processors inside. >> > >Quote on: >------------------------------------------ > >It's 4 nodes. > >1 node = 8 processor Xeon. >------------------------------------------ >quote off > >"one node = 8 processor xeon". That is _clearly_ SMP. And _that_ is the box I >was talking about. I've run on more than one of them. For chess they are bad, >running 2 copies of crafty produces 2x NPS of one copy. Running 4 copies >produces nearly 4x the NPS of one copy. Running 8 copes produces barely 6x the >1 copy speed. Would this be considered a NUMA relationship, between the two 4-way nodes. Does your NUMA crafty handle this better than your older SMP-only crafty? > >_that_ was the PCI bottleneck I was talking about. > > >>> >>> >>>>practical even cheapo myrinet gives 2.7 us. >>>> >>>>>seems to be 1.5x a 4-way which is not that great. If you run a program that >>>>>runs out of cache quickly, this drops even further. >>>> >>>>you can't run smp programs over networks obviously. >>> >>>That's why I wasn't talking about networks. You originally said this machine >>>(New Hydra) has a node with 8 processors. That is what I am talking about. >> >>Perhaps in future read the subject >> > > >Perhaps in the future you should remember what you wrote and what I responded >to? I haven't changed the subject a single time in my posts in this thread. > > > >>> >>> >>>> >>>>> >>>>>> >>>>>>Latency of one way pingpong is around 2.7 us with myrinet. That excludes the >>>>>>router costs which will be also at about 1 us i guess for random data traffic >>>>>>(up to 35 ns for bandwidth traffic). >>>>>> >>>>>>Vincent >>>>> >>>>> >>>>>All depends. We have myrinet here and are probably going to use that in our new >>>>>opteron cluster when we buy it after the dual-core opterons start shipping in >>>>>quantity... >>>> >>>>For chess myrinet sucks ass to say it very polite, because it doesn't allow DSM >>>>(distributed shared memory). >>>> >>>>For just a few dollar more you can get quadrics or dolphin which do have better >>>>latencies (dolphin 1 us) and allow distributed shared memory. >>>> >>>>The real major problem with myrinet is that the receiving process must non stop >>>>receive the messages and process them. So you must make some kind of hand timing >>>>within the search process to do just that. >>>> >>>>With DSM your processes don't feel any of that. >>>> >>>>A 8 node quadrics network is 13095 dollar. That includes everything. >>>> >>>>Quadrics is used in the fastest supercomputers, like the nuclear supercomputer >>>>France just ordered a while ago. It scales far superior to myrinet when you >>>>start scaling above those 8 nodes. >>>> >>>>For chess using the DSM features in a program is not so trivial, but pretty easy >>>>compared to the task of parallellizing a product. >>>> >>>>Vincent >>> >>> >>>Just remember that most supercomputer applications don't care about latency, >> >>Just remember that i don't build a cluster for a matrix calculation, but for >>DIEP :) >> >>In which case myrinet sucks :) > > >That's fine. But "supercomputers" are not built to your specifications. They >are built to address the programming requirements of large numerical systems >being used. They do that well. Your car wasn't designed to fly either. You'd >have hell getting it to do so, although a lowly piper cub can do it since it was >designed to do so... > > >> >>>they care about bandwidth. Large applications are all about streaming data, >>the >>>latency for the first word is not important when several million are going to >>>follow back-to-back. All that matters to the big applications is how frequently >>>do I get the next word, the latency for the first word gets buried in the cost >>>to transfer the remaining millions of words. That's what makes vector computers >>>so powerful for the right kinds of applications, as opposed to these "toy >>>supercomputers" that just use lots of general purpose processors and sloppy >>>interconnections. >> >>Majority of supercomputers are not vector computers. Latency is important also >>at supercomputers. Majority of jobs run at supercomputers are 4-8 processors and >>eat half the system time a year. The other half is matrix calculations and >>trivially could run at cheapo clusters. >> >>Only a few applications are really optimized and i wonder why they don't run >>those on clusters but use very expensive SGI type hardware for it which delivers >>very little flops per dollar. >> >>Vincent > >Because a cluster can't offer 1/100th the total memory bandwidth of a big Cray >vector box.
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.