Author: Vincent Diepeveen
Date: 10:26:04 02/01/05
Go up one level in this thread
On February 01, 2005 at 12:59:31, Robert Hyatt wrote: >On February 01, 2005 at 10:59:18, Vincent Diepeveen wrote: > >>On February 01, 2005 at 00:56:17, Robert Hyatt wrote: >> >>>On January 31, 2005 at 13:35:02, Vincent Diepeveen wrote: >>> >>>>On January 31, 2005 at 13:03:43, Robert Hyatt wrote: >>>> >>>>>On January 31, 2005 at 10:14:28, Vincent Diepeveen wrote: >>>>> >>>>>>On January 31, 2005 at 10:01:16, Vincent Lejeune wrote: >>>>>> >>>>>>> >>>>>>>news from 28/01/05 (more to come) >>>>>>> >>>>>>>http://hydrachess.com/hydra-scylla.html >>>>>>> >>>>>>>32 nodes (previous version has 16) , no information about CPU power and FPGA >>>>>>>cards yet ... >>>>>> >>>>>>It's 4 nodes. >>>>>> >>>>>>1 node = 8 processor Xeon. >>>>>> >>>>>>FPGA cards would get double speed. So must be between 30Mhz and 60Mhz. They only >>>>>>use development fpga cards. So never use the real power of fpga (which is >>>>>>printing your own processor which can run hands down at 600Mhz or more). They >>>>>>stick to development cards for some unknown reason to me. >>>>>> >>>>>>CPU power is not interesting at all of course, cards do the work. >>>>>> >>>>>>Vincent >>>>> >>>>>I hope not. Old machine used 8 boxes with 2 cpus per box. Going to 8-way xeons >>>>>is a performance killer. The PCI bus just can't keep up. >>>> >>>>Ok the sheikh nor his righthand didn't know himself very well the architectural >>>>details which we will forgive them. >>>> >>>>I have accurate information now. >>>> >>>>It is a 32 node system. With each node a dual. They have however 32 FPGA cards >>>>because when ordering the person filling in the form (and i am NOT going to post >>>>who it was, but it was NOT the sheikh) confused nodes for cpu's. >>>> >>>>So they have a mighty 32 node myrinet now with 64 processors. However 32 cards >>>>so they run 64 processor effectively while serving with 32 cards which do the >>>>job. Cards at 55Mhz. >>>> >>>>Please note that PCI bus isn't the problem. They are using pci-x. >>> >>>PCI-X falls flat if you have 8 cpus in a single box. I have run on such >> >>It's not the pci-x which is the problem at all. >> >>It's that it is simply tough programming to get it to work. > >No it isn't. One bus to memory, 8 processors hanging on the bus trying to get >to memory. They get in the way of each other, and that is why bus architectures >don't scale very well beyond 4. And even going to 4 requires 4-way interleaving >to keep up. But they don't go to 8-way interleaving on the Dell-type boxes. >Others do but they have a price that shows it... > >For 8 and up, a crossbar is really the right way to go, if the price can be >handled. Otherwise a NUMA-type approach like the AMD solution is most >affordable. > >> >>You simply must limit any read or write remote to the ultimate maximum. >> >>Multithreading, forget it. > >multi-threading or multiple processes is not the issue here. What you can do >with one, you can do with the other. One just offers easier-to-use features for >certain applications. > > >> >>>machines. PCI does pretty well on 4way systems, but on 8-way, overall gain >> >>PCI is at least 4 times slower than pci-x in latency. >> >>pci-x can give easily 1 us if the network card is fast enough. >> > >We are not talking about network. We are talking (at least I am talking) about >SMP-type boxes with 8 cpus in a single chassis, like the 8-cpu xeon Dell sells, >or like the 4-cpu xeon boxes I have here... I'm not talking about 8 way smp boxes. Those are more expensive than a 32 processor cluster when you get latest processors inside. > > >>practical even cheapo myrinet gives 2.7 us. >> >>>seems to be 1.5x a 4-way which is not that great. If you run a program that >>>runs out of cache quickly, this drops even further. >> >>you can't run smp programs over networks obviously. > >That's why I wasn't talking about networks. You originally said this machine >(New Hydra) has a node with 8 processors. That is what I am talking about. Perhaps in future read the subject > > >> >>> >>>> >>>>Latency of one way pingpong is around 2.7 us with myrinet. That excludes the >>>>router costs which will be also at about 1 us i guess for random data traffic >>>>(up to 35 ns for bandwidth traffic). >>>> >>>>Vincent >>> >>> >>>All depends. We have myrinet here and are probably going to use that in our new >>>opteron cluster when we buy it after the dual-core opterons start shipping in >>>quantity... >> >>For chess myrinet sucks ass to say it very polite, because it doesn't allow DSM >>(distributed shared memory). >> >>For just a few dollar more you can get quadrics or dolphin which do have better >>latencies (dolphin 1 us) and allow distributed shared memory. >> >>The real major problem with myrinet is that the receiving process must non stop >>receive the messages and process them. So you must make some kind of hand timing >>within the search process to do just that. >> >>With DSM your processes don't feel any of that. >> >>A 8 node quadrics network is 13095 dollar. That includes everything. >> >>Quadrics is used in the fastest supercomputers, like the nuclear supercomputer >>France just ordered a while ago. It scales far superior to myrinet when you >>start scaling above those 8 nodes. >> >>For chess using the DSM features in a program is not so trivial, but pretty easy >>compared to the task of parallellizing a product. >> >>Vincent > > >Just remember that most supercomputer applications don't care about latency, Just remember that i don't build a cluster for a matrix calculation, but for DIEP :) In which case myrinet sucks :) >they care about bandwidth. Large applications are all about streaming data, the >latency for the first word is not important when several million are going to >follow back-to-back. All that matters to the big applications is how frequently >do I get the next word, the latency for the first word gets buried in the cost >to transfer the remaining millions of words. That's what makes vector computers >so powerful for the right kinds of applications, as opposed to these "toy >supercomputers" that just use lots of general purpose processors and sloppy >interconnections. Majority of supercomputers are not vector computers. Latency is important also at supercomputers. Majority of jobs run at supercomputers are 4-8 processors and eat half the system time a year. The other half is matrix calculations and trivially could run at cheapo clusters. Only a few applications are really optimized and i wonder why they don't run those on clusters but use very expensive SGI type hardware for it which delivers very little flops per dollar. Vincent
This page took 0.04 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.